Blog Archive

data-plus-ai-2022

Ahana Will Co-Lead Session At Data & AI Summit About Presto Open Source SQL Query Engine

San Mateo, Calif. – June 23, 2022Ahana, the only SaaS for Presto, today announced that Rohan Pednekar, Ahana’s senior product manager, will co-lead a session with Meta Developer Advocate Philip Bell at Data & AI Summit about Presto, the Meta-born open source high performance, distributed SQL query engine. The event is being held June 27 – 30 in San Francisco, CA and virtual.

Session Title: “Presto 101 – An Introduction to Open Source Presto.”

Session Time: On Demand

Session Presenters: Ahana’s Rohan Pednekar, senior product manager; and Meta Developer Advocate Philip Bell.

Session Details: Presto is a widely adopted distributed SQL engine for data lake analytics. With Presto, users can perform ad hoc querying of data in place, which helps solve challenges around time to discover and the amount of time it takes to do ad hoc analysis. Additionally, new features like the disaggregated coordinator, Presto-on-Spark, scan optimizations, a reusable native engine, and a Pinot connector enable added benefits around performance, scale, and ecosystem.

In this session, Rohan and Philip will introduce the Presto technology and share why it’s becoming so popular. In fact, companies like Facebook, Uber, Twitter, Alibaba, and many others use Presto for interactive ad hoc queries, reporting & dashboarding data lake analytics, and much more. This session will show a quick demo on getting Presto running in AWS.

To register for Data & AI Summit, please go to the event’s registration page to purchase a registration.

TWEET THIS: @AhanaIO to present at #DataAISummit about #Presto https://bit.ly/3n8YDQt #OpenSource #Analytics #Cloud

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

ahana logo

AWS Redshift Query Limits

What is Amazon Redshift?

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2. It has evolved and been enhanced since then into a powerful distributed system that can provide speedy results across millions of rows. Conceptually it is based on node clusters, with a leader node and compute nodes. The leader generates the execution plan for queries and distributes those tasks to the compute nodes. Scalability is achieved with elastic scaling that can add/modify worker nodes as needed and quickly. We’ll discuss the details in the article below.

Limitations of Using Amazon Redshift

There are of course Redshift limitations on many parameters, which Amazon refers to as “quotas”. There is a Redshift query limit, a database limit, a Redshift query size limit, and many others. These have default values from Amazon and are per AWS region. Some of these quotas can be increased by submitting an Amazon Redshift Limit Increase Form. Below is a table of some of these quota limitations.

QuotaValueAdjustable
Nodes per cluster128Yes
Nodes per region200Yes
Schemas per DB per cluster9,900No
Tables per node type9,900 – 100,000No
Query limit50No
Databases per cluster60No
Stored procedures per DB10,000No
Query size limit100,000 rowsYes
Saved queries2,500Yes
Correlated SubqueriesNeed to be rewrittenNo

AWS Redshift Performance

To start, Redshift is storing data in compressed, columnar format. This means that there is less area on disk to scan and less data that has to be moved around. Add to that indexing and you have the base recipe for high performance. In addition, Redshift maintains a results cache, so frequently executed queries are going to be highly performant. This is aided by the query plan optimization done in the leader node. Redshift also optimizes the data partitioning in a highly efficient manner to complement the optimizations done in the columnar data algorithms.

Scaling

There are a robust number of scaling strategies available from Redshift. With just a few clicks in the AWS Redshift console, or even with a single API call, you can change node types, add nodes and pause/resume the cluster. You are also able to use Elastic Resize to dynamically adjust your provisioned capacity within a few minutes. A Resize Scheduler is also available where you can schedule changes, say for month-end processing for example. There is also Concurrency Scaling that can automatically provision additional capacity for dynamic workloads.

Pricing

A lot of variables go into Redshift pricing depending on the scale and features you go with. All of the details and a pricing calculator can be found on the Amazon Redshift Pricing page. To give you a quick overview, however, prices start as low as $.25 per hour. Pricing is based on compute time and size and goes up to $13.04 per hour. Amazon provides some incentives to get you started and try out the service.

First, similar to the Ahana Cloud Commnity Edition, Redshift has a “Free Tier”, if your company has never created a Redshift cluster then you are eligible for a DC2 large node trial for two months. This provides 750 hours per month for free, which is enough to continuously run that DC2 node, with 160GB of compressed SSD storage. Once your trial expires or your usage exceeds 750 hours per month, you can either keep it running with their “on-demand” pricing, or shut it down.

Next, there is a $500 credit available to use their Amazon Redshift Serverless option if you have never used it before. This applies to both the compute and storage and how long it will last depends entirely on the compute capacity you selected, and your usage.

Then there is “on-demand” pricing. This option allows you to just pay for provisioned capacity by the hour with no commitments or upfront costs, partial hours are billed in one-second increments. Amazon allows you to pause and resume these nodes when you aren’t using them so you don’t continue to pay, and you also preserve what you have, you’ll only pay for backup storage.

Summary

Redshift provides a robust, scalable environment that is well suited to managing data in a data warehouse. Amazon provides a variety of ways to easily give Redshift a try without getting too tied in. Not all analytic workloads make sense in a data warehouse, however, and if you are already landing data into AWS S3, then you have the makings of a data lakehouse that can offer better price/performance. A managed Presto service, such as Ahana, can be the answer to that challenge.

press-community-announcement

Ahana Announces Additional $7.2 Million Funding Led by Liberty Global Ventures and Debuts Free Community Edition of Ahana Cloud for Presto for the Open Data Lakehouse

Only SaaS for Presto now available for free with Ahana Community Edition; Additional capital raise validates growth of the Open Data Lakehouse market

San Mateo, Calif. – June 16, 2022 Ahana, the only Software as a Service for Presto, today announced an additional investment of $7.2 million from Liberty Global Ventures with participation from existing investor GV, extending the company’s Series A financing to $27.2 million. Liberty Global is a world leader in converged broadband, video and mobile communications services. This brings the total amount of funding raised to date to $32 million. Ankur Prakash, Partner, Liberty Global Ventures, will join the Ahana Board of Directors as a board observer. Ahana will use the funding to continue to grow its technical team and product development; evangelize the Presto community; and develop go-to-market programs to meet customer demand. 

Ahana also announced today Ahana Cloud for Presto Community Edition, designed to simplify the deployment, management and integration of Presto, an open source distributed SQL query engine, for the Open Data Lakehouse. Ahana Community Edition is immediately available to everyone, including users of the 100,000+ downloads of Ahana’s PrestoDB Sandbox on DockerHub. It provides simple, distributed Presto cluster provisioning and tuned out-of-the-box configurations, bringing the power of Presto to data teams of all sizes for free. Instead of downloading and installing open source Presto software, data teams can quickly learn about Presto and deploy initial SQL data lakehouse use cases in the cloud. Community Edition users can easily upgrade to the full version of Ahana Cloud for Presto, which adds increased security including integration with Apache Ranger and AWS Lake Formation, price-performance benefits including multi-level caching, and enterprise-level support.

“Over the past year we’ve focused on bringing the easiest managed service for Presto to market, and today we’re thrilled to announce a forever-free community edition to drive more adoption of Presto across the broader open source user community. Our belief in Presto as the best SQL query engine for the Open Data Lakehouse is underscored by our new relationship with Liberty Global,” said Steven Mih, Cofounder and CEO, Ahana. “With the Community Edition, data platform teams get unlimited production use of Presto at a good amount of scale for lightning-fast insights on their data.”

“Today we’re seeing more companies embrace cloud-based technologies to deliver superior customer experiences. An underlying architectural pattern is the leveraging of an Open Data Lakehouse, a more flexible stack that solves for the high costs, lock-in, and limitations of the traditional data warehouse,” said Ankur Prakash, Partner, Liberty Global Ventures. “Ahana has innovated to address these challenges with its industry-leading approach to bring the most high-performing, cost-effective SQL query engine to data platforms teams. Our investment in Ahana reflects our commitment to drive more value for businesses, specifically in the next evolution of the data warehouse to Open Data Lakehouses.” 

Details of Ahana Cloud for Presto Community Edition, include:

●        Free to use, forever

●        Use of Presto in an Open Data Lakehouse with open file formats like Apache Parquet and advanced lake data management like Apache Hudi

●        A single Presto cluster with all supported instance types except Graviton

●        Pre-configured integrations to multiple data sources including the Hive Metastore for Amazon S3, Amazon OpenSearch, Amazon RDS for MySQL, Amazon RDS for PostgreSQL, and Amazon Redshift

●        Community support through public Ahana Community Slack channel plus a free 45 minute onboarding session with an Ahana Presto engineer

●        Seamless upgrade to the full version which includes enterprise features like data access control, autoscaling, multi-level caching, and SLA-based support

“Enterprises continue to embrace ‘lake house’ platforms that apply SQL structures and querying capabilities to cloud-native object stores,” said Kevin Petrie, VP of Research, Eckerson Group. “Ahana’s new Community Edition for Presto offers a SQL query engine that can help advance market adoption of the lake house.”

Supporting Resources:

Get Started with the Ahana Community Edition

Join the Ahana Community Slack Channel

Tweet this:  @AhanaIO announces additional $7.2 million Series A financing led by Liberty Global Ventures; debuts free community edition of Ahana #Cloud for #Presto on #AWS https://bit.ly/3xlAVW4 

About Ahana

Ahana is the only SaaS for Presto on AWS with the vision to be the SQL engine for the Open Data Lakehouse. Presto, the open source project created by Meta and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, Third Point Ventures, and Liberty Global Ventures. Follow Ahana on LinkedIn, Twitter and Presto Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Open Source Summit Logo

Ahana Will Co-Lead Session At Open Source Summit About Presto SQL Query Engine

San Mateo, Calif. – June 14, 2022 — Ahana, the only SaaS for Presto, today announced that Rohan Pednekar, Ahana’s senior product manager, will co-lead a session with Meta Developer Advocate Philip Bell at the Linux Foundation’s Open Source Summit about Presto, the Meta-born open source high performance, distributed SQL query engine. The event is being held June 20 – 24 in Austin, TX and virtual.

Session Title: “Introduction to Presto – The SQL Engine for Data Platform Teams.”

Session Time: Tuesday, June 21 at 11:10am – 11:50am CT

Session Presenters: Ahana Rohan Pednekar, senior product manager; and Meta Developer Advocate Philip Bell. 

Session Details: Presto is an open-source high performance, distributed SQL query engine. Born at Facebook in 2012, Presto was built to run interactive queries on large Hadoop-based clusters. Today it has grown to support many users and use cases including ad hoc query, data lake analytics, and federated querying. In this session, we will give an overview of Presto including architecture and how it works, the problems it solves, and most common use cases. We’ll also share the latest innovation in the project as well as what’s on the roadmap.

To register for Open Source Summit, please go to the event’s registration page to purchase a registration.

TWEET THIS: @Ahana to present at #OpenSourceSummit about #Presto https://bit.ly/3xMGQ7M #OpenSource #Analytics #Cloud 

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

ahana + cube_integration

Announcing the Cube integration with Ahana: Querying multiple data sources with managed Presto and Cube

See how Ahana and Cube work together to help you set up a Presto cluster and build a single source of truth for metrics without spending days reading cryptic docs

Ahana provides managed Presto clusters running in your AWS account.

Presto is an open-source distributed SQL query engine, originally developed at Facebook, now hosted under the Linux Foundation. It connects to multiple databases or other data sources (for example, Amazon S3). We can use a Presto cluster as a single compute engine for an entire data lake.

Presto implements the data federation feature: you can process data from multiple sources as if they were stored in a single database. Because of that, you don’t need a separate ETL (Extract-Transform-Load) pipeline to prepare the data before using it. However, running and configuring a single-point-of-access for multiple databases (or file systems) requires Ops skills and an additional effort.

However, no data engineer wants to do the Ops work. Using Ahana, you can deploy a Presto cluster within minutes without spending hours configuring the service, VPCs, and AWS access rights. Ahana hides the burden of infrastructure management and allows you to focus on processing your data.

What is Cube?

Cube is a headless BI platform for accessing, organizing, and delivering data. Cube connects to many data warehouses, databases, or query engines, including Presto, and allows you to quickly build data applications or analyze your data in BI tools. It serves as the single source of truth for your business metrics.

This article will demonstrate the caching functionality, access control, and flexibility of the data retrieval API.

Integration

Cube’s battle-tested Presto driver provides the out-of-the-box connectivity to Ahana.

You just need to provide the credentials: Presto host name and port, user name and password, Presto catalog and schema. You’ll also need to set CUBEJS_DB_SSL to true since Ahana has secures Presto connections with SSL.

Check the docs to learn more about connecting Cube to Ahana.

Example: Parsing logs from multiple data sources with Ahana and Cube

Let’s build a real-world data application with Ahana and Cube.

We will use Ahana to join Amazon Sagemaker Endpoint logs stored as JSON files in S3 with the data retrieved from a PostgreSQL database.

Suppose you work at a software house specializing in training ML models for your clients and delivering ML inference as a REST API. You have just trained new versions of all models, and you would like to demonstrate the improvements to the clients.

Because of that, you do a canary deployment of the versions and gather the predictions from the new and the old models using the built-in logging functionality of AWS Sagemaker Endpoints: a managed deployment environment for machine learning models. Additionally, you also track the actual production values provided by your clients.

You need all of that to prepare personalized dashboards showing the results of your hard work.

Let us show you how Ahana and Cube work together to help you achieve your goal quickly without spending days reading cryptic documentation.

You will retrieve the prediction logs from an S3 bucket and merge them with the actual values stored in a PostgreSQL database. After that, you calculate the ML performance metrics, implement access control, and hide the data source complexity behind an easy-to-use REST API.

Architecture diagram

In the end, you want a dashboard looking like this:

The final result: two dashboards showing the number of errors made by two variants of the ML model

The final result: two dashboards showing the number of errors made by two variants of the ML model

How to configure Ahana?

Allowing Ahana to access your AWS account

First, let’s login to Ahana, and connect it to your AWS account. We must create an IAM role allowing Ahana to access our AWS account.

On the setup page, click the “Open CloudFormation” button. After clicking the button, we get redirected to the AWS page for creating a new CloudFormation stack from a template provided by Ahana. Create the stack and wait until CloudFormation finishes the setup.

When the IAM role is configured, click the stack’s Outputs tab and copy the AhanaCloudProvisioningRole key value.

The Outputs tab containing the identifier of the IAM role for Ahana

We have to paste it into the Role ARN field on the Ahana setup page and click the “Complete Setup” button.

The Ahana setup page

Creating an Ahana cluster

After configuring AWS access, we have to start a new Ahana cluster.

In the Ahana dashboard, click the “Create new cluster” button.

Ahana create new cluster

In the setup window, we can configure the type of the AWS EC2 instances used by the cluster, scaling strategy, and the Hive Metastore. If you need a detailed description of the configuration options, look at the “Create new cluster” section of the Ahana documentation.

Ahana cluster setup page

Remember to add at least one user to your cluster! When we are satisfied with the configuration, we can click the “Create cluster” button. Ahana needs around 20-30 minutes to setup a new cluster.

Retrieving data from S3 and PostgreSQL

After deploying a Presto cluster, we have to connect our data sources to the cluster because, in this example, the Sagemaker Endpoint logs are stored in S3 and PostgreSQL.

Adding a PostgreSQL database to Ahana

In the Ahana dashboard, click the “Add new data source” button. We will see a page showing all supported data sources. Let’s click the “Amazon RDS for PostgreSQL” option.

In the setup form displayed below, we have to provide the database configuration and click the “Add data source” button.

PostgreSQL data source configuration

Adding an S3 bucket to Ahana

AWS Sagemaker Endpoint stores their logs in an S3 bucket as JSON files. To access those files in Presto, we need to configure the AWS Glue data catalog and add the data catalog to the Ahana cluster.

We have to login to the AWS console, open the AWS Glue page and add a new database to the data catalog (or use an existing one).

AWS Glue databases

Now, let’s add a new table. We won’t configure it manually. Instead, let’s create a Glue crawler to generate the table definition automatically. On the AWS Glue page, we have to click the “Crawlers” link and click the “Add crawler” button.

AWS Glue crawlers

After typing the crawler’s name and clicking the “Next” button, we will see the Source Type page. On this page, we have to choose the”Data stores” and “Crawl all folders” (in our case, “Crawl new folders only” would work too).

Here we specify where the crawler should look for new data
Here we specify where the crawler should look for new data

On the “Data store” page, we pick the S3 data store, select the S3 connection (or click the “Add connection” button if we don’t have an S3 connection configured yet), and specify the S3 path.

Note that Sagemaker Endpoints store logs in subkeys using the following key structure: endpoint-name/model-variant/year/month/day/hour. We want to use those parts of the key as table partitions.

Because of that, if our Sagemaker logs have an S3 key: s3://the_bucket_name/sagemaker/logs/endpoint-name/model-variant-name/year/month/day/hour, we put only the s3://the_bucket_name/sagemaker/logs key prefix in the setup window!

IAM role configuration

Let’s click the “Next” button. In the subsequent window, we choose “No” when asked whether we want to configure another data source. Glue setup will ask about the name of the crawler’s IAM role. We can create a new one:

Next, we configure the crawler’s schedule. A Sagemaker Endpoint adds new log files in near real-time. Because of that, it makes sense to scan the files and add new partitions every hour:

configuring the crawler's schedule

In the output configuration, we need to customize the settings.

First, let’s select the Glue database where the new tables get stored. After that, we modify the “Configuration options.”

We pick the “Add new columns only” because we will make manual changes in the table definition, and we don’t want the crawler to overwrite them. Also, we want to add new partitions to the table, so we check the “Update all new and existing partitions with metadata from the table.” box.

Crawler's output configuration

Let’s click “Next.” We can check the configuration one more time in the review window and click the “Finish” button.

Now, we can wait until the crawler runs or open the AWS Glue Crawlers view and trigger the run manually. When the crawler finishes running, we go to the Tables view in AWS Glue and click the table name.

AWS Glue tables

In the table view, we click the “Edit table” button and change the “Serde serialization lib” to “org.apache.hive.hcatalog.data.JsonSerDe” because the AWS JSON serialization library isn’t available in the Ahana Presto cluster.

JSON serialization configured in the table details view

We should also click the “Edit schema” button and change the default partition names to values shown in the screenshot below:

Default partition names replaced with their actual names

After saving the changes, we can add the Glue data catalog to our Ahana Presto cluster.

Configuring data sources in the Presto cluster

Go back to the Ahana dashboard and click the “Add data source” button. Select the “AWS Glue Data Catalog for Amazon S3” option in the setup form.

AWS Glue data catalog setup in Ahana

Let’s select our AWS region and put the AWS account id in the “Glue Data Catalog ID” field. After that, we click the “Open CloudFormation” button and apply the template. We will have to wait until CloudFormation creates the IAM role.

When the role is ready, we copy the role ARN from the Outputs tab and paste it into the “Glue/S3 Role ARN” field:

The "Outputs" tab shows the ARN of the IAM role used to access the Glue data catalog from Ahana
The “Outputs” tab shows the ARN of the IAM role used to access the Glue data catalog from Ahana

On the Ahana setup page, we click the “Add data source” button.

Adding data sources to an existing cluster

Finally, we can add both data sources to our Ahana cluster.

We have to open the Ahana “Clusters” page, click the “Manage” button, and scroll down to the “Data Sources‚” section. In this section, we click the “Manage data sources” button.

We will see another setup page where we check the boxes next to the data sources we want to configure and click the “Modify cluster” button. We will need to confirm that we want to restart the cluster to make the changes.

Adding data sources to an Ahana cluster

Writing the Presto queries

The actual structure of the input and output from an AWS Sagemaker Endpoint depends on us. We can send any JSON request and return a custom JSON object.

Let’s assume that our endpoint receives a request containing the input data for the machine learning model and a correlation id. We will need those ids to join the model predictions with the actual data.

Example input:

{"time_series": [51, 37, …, 7], "correlation_id": "cf8b7b9a-6b8a-45fe-9814-11a4b17c710a"}

In the response, the model returns a JSON object with a single “prediction”key and a decimal value:

{"prediction": 21.266147618448954}

A single request in Sagemaker Endpoint logs looks like this:

{"captureData": {"endpointInput": {"observedContentType": "application/json", "mode": "INPUT", "data": "eyJ0aW1lX3NlcmllcyI6IFs1MS40MjM5MjAzODYxNTAzODUsIDM3LjUwOTk2ODc2MTYwNzM0LCAzNi41NTk4MzI2OTQ0NjAwNTYsIDY0LjAyMTU3MzEyNjYyNDg0LCA2MC4zMjkwMzU2MDgyMjIwODUsIDIyLjk1MDg0MjgxNDg4MzExLCA0NC45MjQxNTU5MTE1MTQyOCwgMzkuMDM1NzA4Mjg4ODc2ODA1LCAyMC44NzQ0Njk2OTM0MzAxMTUsIDQ3Ljc4MzY3MDQ3MjI2MDI1NSwgMzcuNTgxMDYzNzUyNjY5NTE1LCA1OC4xMTc2MzQ5NjE5NDM4OCwgMzYuODgwNzExNTAyNDIxMywgMzkuNzE1Mjg4NTM5NzY5ODksIDUxLjkxMDYxODYyNzg0ODYyLCA0OS40Mzk4MjQwMTQ0NDM2OCwgNDIuODM5OTA5MDIxMDkwMzksIDI3LjYwOTU0MTY5MDYyNzkzLCAzOS44MDczNzU1NDQwODYyOCwgMzUuMTA2OTQ4MzI5NjQwOF0sICJjb3JyZWxhdGlvbl9pZCI6ICJjZjhiN2I5YS02YjhhLTQ1ZmUtOTgxNC0xMWE0YjE3YzcxMGEifQ==", "encoding": "BASE64"}, "endpointOutput": {"observedContentType": "application/json", "mode": "OUTPUT", "data": "eyJwcmVkaWN0aW9uIjogMjEuMjY2MTQ3NjE4NDQ4OTU0fQ==", "encoding": "BASE64"}}, "eventMetadata": {"eventId": "b409a948-fbc7-4fa6-8544-c7e85d1b7e21", "inferenceTime": "2022-05-06T10:23:19Z"}

AWS Sagemaker Endpoints encode the request and response using base64. Our query needs to decode the data before we can process it. Because of that, our Presto query starts with data decoding:

with sagemaker as (
  select
  model_name,
  variant_name,
  cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.data)), '$.correlation_id') as varchar) as correlation_id,
  cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.data)), '$.prediction') as double) as prediction
  from s3.sagemaker_logs.logs
)
, actual as (
  select correlation_id, actual_value
  from postgresql.public.actual_values
)

After that, we join both data sources and calculate the absolute error value:

sql
, logs as (
  select model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as actual
  from sagemaker
  left outer join actual
  on sagemaker.correlation_id = actual.correlation_id
)
, errors as (
  select abs(prediction - actual) as abs_err, model_name, model_variant from logs
),

Now, we need to calculate the percentiles using the `approx_percentile` function. Note that we group the percentiles by model name and model variant. Because of that, Presto will produce only a single row per every model-variant pair. That’ll be important when we write the second part of this query.

percentiles as (
  select approx_percentile(abs_err, 0.1) as perc_10,
  approx_percentile(abs_err, 0.2) as perc_20,
  approx_percentile(abs_err, 0.3) as perc_30,
  approx_percentile(abs_err, 0.4) as perc_40,
  approx_percentile(abs_err, 0.5) as perc_50,
  approx_percentile(abs_err, 0.6) as perc_60,
  approx_percentile(abs_err, 0.7) as perc_70,
  approx_percentile(abs_err, 0.8) as perc_80,
  approx_percentile(abs_err, 0.9) as perc_90,
  approx_percentile(abs_err, 1.0) as perc_100,
  model_name,
  model_variant
  from errors
  group by model_name, model_variant
)

In the final part of the query, we will use the filter expression to count the number of values within buckets. Additionally, we return the bucket boundaries. We need to use an aggregate function max (or any other aggregate function) because of the group by clause. That won’t affect the result because we returned a single row per every model-variant pair in the previous query.

SELECT count(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10
, max(perc_10) as perc_10_value
, count(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20
, max(perc_20) as perc_20_value
, count(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30
, max(perc_30) as perc_30_value
, count(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40
, max(perc_40) as perc_40_value
, count(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50
, max(perc_50) as perc_50_value
, count(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60
, max(perc_60) as perc_60_value
, count(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70
, max(perc_70) as perc_70_value
, count(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80
, max(perc_80) as perc_80_value
, count(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90
, max(perc_90) as perc_90_value
, count(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100
, max(perc_100) as perc_100_value
, p.model_name, p.model_variant
FROM percentiles p, errors e group by p.model_name, p.model_variant

How to configure Cube?

In our application, we want to display the distribution of absolute prediction errors.

We will have a chart showing the difference between the actual value and the model’s prediction. Our chart will split the absolute errors into buckets (percentiles) and display the number of errors within every bucket.

If the new variant of the model performs better than the existing model, we should see fewer large errors in the charts. A perfect (and unrealistic) model would produce a single error bar in the left-most part of the chart with the “0” label.

At the beginning of the article, we looked at an example chart that shows no significant difference between both model variants:

example chart_Both models perform almost the same

If the variant B were better than the variant A, its chart could look like this (note the axis values in both pictures

An improved second version of the model_example chart

Creating a Cube deployment

Cube Cloud is the easiest way to get started with Cube. It provides a fully managed, ready to use Cube cluster. However, if you prefer self-hosting, then follow this tutorial.

First, please create a new Cube Cloud deployment. Then, open the “Deployments” page and click the “Create deployment” button.

Cube Deployments dashboard page

We choose the Presto cluster:

Database connections supported by Cube

Finally, we fill out the connection parameters and click the “Apply”button. Remember to enable the SSL connection!

Presto configuration page

Defining the data model in Cube

We have our queries ready to copy-paste, and we have configured a Presto connection in Cube. Now, we can define the Cube schema to retrieve query results.

Let’s open the Schema view in Cube and add a new file.

The schema view in Cube showing where we should click to create a new file

In the next window, type the file name errorpercentiles.js and click “Create file.”

In the following paragraphs, we will explain parts of the configuration and show you code fragments to copy-paste. You don’t have to do that in such small steps!

Below, you see the entire content of the file. Later, we explain the configuration parameters.

const measureNames = [
  'perc_10', 'perc_10_value',
  'perc_20', 'perc_20_value',
  'perc_30', 'perc_30_value',
  'perc_40', 'perc_40_value',
  'perc_50', 'perc_50_value',
  'perc_60', 'perc_60_value',
  'perc_70', 'perc_70_value',
  'perc_80', 'perc_80_value',
  'perc_90', 'perc_90_value',
  'perc_100', 'perc_100_value',
];

const measures = Object.keys(measureNames).reduce((result, name) => {
  const sqlName = measureNames[name];
  return {
    ...result,
    [sqlName]: {
      sql: () => sqlName,
      type: `max`
    }
  };
}, {});

cube('errorpercentiles', {
  sql: `with sagemaker as (
    select
    model_name,
    variant_name,
    cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.data)), '$.correlation_id') as varchar) as correlation_id,
    cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.data)), '$.prediction') as double) as prediction
    from s3.sagemaker_logs.logs
  )
, actual as (
  select correlation_id, actual_value
  from postgresql.public.actual_values
)
, logs as (
  select model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as actual
  from sagemaker
  left outer join actual
  on sagemaker.correlation_id = actual.correlation_id
)
, errors as (
  select abs(prediction - actual) as abs_err, model_name, model_variant from logs
),
percentiles as (
  select approx_percentile(abs_err, 0.1) as perc_10,
  approx_percentile(abs_err, 0.2) as perc_20,
  approx_percentile(abs_err, 0.3) as perc_30,
  approx_percentile(abs_err, 0.4) as perc_40,
  approx_percentile(abs_err, 0.5) as perc_50,
  approx_percentile(abs_err, 0.6) as perc_60,
  approx_percentile(abs_err, 0.7) as perc_70,
  approx_percentile(abs_err, 0.8) as perc_80,
  approx_percentile(abs_err, 0.9) as perc_90,
  approx_percentile(abs_err, 1.0) as perc_100,
  model_name,
  model_variant
  from errors
  group by model_name, model_variant
)
SELECT count(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10
, max(perc_10) as perc_10_value
, count(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20
, max(perc_20) as perc_20_value
, count(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30
, max(perc_30) as perc_30_value
, count(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40
, max(perc_40) as perc_40_value
, count(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50
, max(perc_50) as perc_50_value
, count(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60
, max(perc_60) as perc_60_value
, count(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70
, max(perc_70) as perc_70_value
, count(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80
, max(perc_80) as perc_80_value
, count(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90
, max(perc_90) as perc_90_value
, count(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100
, max(perc_100) as perc_100_value
, p.model_name, p.model_variant
FROM percentiles p, errors e group by p.model_name, p.model_variant`,

preAggregations: {
// Pre-Aggregations definitions go here
// Learn more here: https://cube.dev/docs/caching/pre-aggregations/getting-started
},

joins: {
},

measures: measures,
dimensions: {
  modelVariant: {
    sql: `model_variant`,
    type: 'string'
  },
  modelName: {
    sql: `model_name`,
    type: 'string'
  },
}
});

In the sql property, we put the query prepared earlier. Note that your query MUST NOT contain a semicolon.

A newly created cube configuration file

We will group and filter the values by the model and variant names, so we put those columns in the dimensions section of the cube configuration. The rest of the columns are going to be our measurements. We can write them out one by one like this:


measures: {
  perc_10: {
    sql: `perc_10`,
    type: `max`
  },
  perc_20: {
    sql: `perc_20`,
    type: `max`
  },
  perc_30: {
    sql: `perc_30`,
    type: `max`
  },
  perc_40: {
    sql: `perc_40`,
    type: `max`
  },
  perc_50: {
    sql: `perc_50`,
    type: `max`
  },
  perc_60: {
    sql: `perc_60`,
    type: `max`
  },
  perc_70: {
    sql: `perc_70`,
    type: `max`
  },
  perc_80: {
    sql: `perc_80`,
    type: `max`
  },
  perc_90: {
    sql: `perc_90`,
    type: `max`
  },
  perc_100: {
    sql: `perc_100`,
    type: `max`
  },
  perc_10_value: {
    sql: `perc_10_value`,
    type: `max`
  },
  perc_20_value: {
    sql: `perc_20_value`,
    type: `max`
  },
  perc_30_value: {
    sql: `perc_30_value`,
    type: `max`
  },
  perc_40_value: {
    sql: `perc_40_value`,
    type: `max`
  },
  perc_50_value: {
    sql: `perc_50_value`,
    type: `max`
  },
  perc_60_value: {
    sql: `perc_60_value`,
    type: `max`
  },
  perc_70_value: {
    sql: `perc_70_value`,
    type: `max`
  },
  perc_80_value: {
    sql: `perc_80_value`,
    type: `max`
  },
  perc_90_value: {
    sql: `perc_90_value`,
    type: `max`
  },
  perc_100_value: {
    sql: `perc_100_value`,
    type: `max`
  }
},
dimensions: {
  modelVariant: {
    sql: `model_variant`,
    type: 'string'
  },
  modelName: {
    sql: `model_name`,
    type: 'string'
  },
}
A part of the error percentiles configuration in Cube

The notation we have shown you has lots of repetition and is quite verbose. We can shorten the measurements defined in the code by using JavaScript to generate them.

We had to add the following code before using the cube function!

First, we have to create an array of column names:


const measureNames = [
  'perc_10', 'perc_10_value',
  'perc_20', 'perc_20_value',
  'perc_30', 'perc_30_value',
  'perc_40', 'perc_40_value',
  'perc_50', 'perc_50_value',
  'perc_60', 'perc_60_value',
  'perc_70', 'perc_70_value',
  'perc_80', 'perc_80_value',
  'perc_90', 'perc_90_value',
  'perc_100', 'perc_100_value',
];

Now, we must generate the measures configuration object. We iterate over the array and create a measure configuration for every column:


const measures = Object.keys(measureNames).reduce((result, name) => {
  const sqlName = measureNames[name];
  return {
    ...result,
    [sqlName]: {
      sql: () => sqlName,
      type: `max`
    }
  };
}, {});

Finally, we can replace the measure definitions with:

measures: measures

After changing the file content, click the “Save All” button.

The top section of the schema view

And click the Continue button in the popup window.

The popup window shows the URL of the test API

In the Playground view, we can test our query by retrieving the chart data as a table (or one of the built-in charts):

An example result in the Playground view

Configuring access control in Cube

In the Schema view, open the cube.js file.

We will use the queryRewrite configuration option to allow or disallow access to data.

First, we will reject all API calls without the models field in the securityContext. We will put the identifier of the models the user is allowed to see in their JWT token. The security context contains all of the JWT token variables.

For example, we can send a JWT token with the following payload. Of course, in the application sending queries to Cube, we must check the user’s access right and set the appropriate token payload. Authentication and authorization are beyond the scope of this tutorial, but please don’t forget about them.

The Security Context window in the Playground view

After rejecting unauthorized access, we add a filter to all queries.

We can distinguish between the datasets accessed by the user by looking at the data specified in the query. We need to do it because we must filter by the modelName property of the correct table.

In our queryRewrite configuration in the cube.js file, we use the query.filter.push function to add a modelName IN (model_1, model_2, ...) clause to the SQL query:

module.exports = {
  queryRewrite: (query, { securityContext }) => {
    if (!securityContext.models) {
      throw new Error('No models found in Security Context!');
    }
    query.filters.push({
      member: 'percentiles.modelName',
      operator: 'in',
      values: securityContext.models,
    });
    return query;
  },
};

Configuring caching in Cube

By default, Cube caches all Presto queries for 2 minutes. Even though Sagemaker Endpoints stores logs in S3 in near real-time, we aren’t interested in refreshing the data so often. Sagemaker Endpoints store the logs in JSON files, so retrieving the metrics requires a full scan of all files in the S3 bucket.

When we gather logs over a long time, the query may take some time. Below, we will show you how to configure the caching in Cube. We recommend doing it when the end-user application needs over one second to load the data.

For the sake of the example, we will retrieve the value only twice a day.

Preparing data sources for caching

First, we must allow Presto to store data in both PostgreSQL and S3. It’s required because, in the case of Presto, Cube supports only the simple pre-aggregation strategy. Therefore, we need to pre-aggregate the data in the source databases before loading them into Cube.

In PostgreSQL, we grant permissions to the user account used by Presto to access the database:

GRANT CREATE ON SCHEMA the_schema_we_use TO the_user_used_in_presto;
GRANT USAGE ON SCHEMA the_schema_we_use TO the_user_used_in_presto;

If we haven’t modified anything in the AWS Glue data catalog, Presto already has permission to create new tables and store their data in S3, but the schema doesn’t contain the target S3 location yet, so all requests will fail.

We must login to AWS Console, open the Glue data catalog, and create a new database called prod_pre_aggregations. In the database configuration, we must specify the S3 location for the table content.

If you want to use a different database name, follow the instructions in our documentation.

Caching configuration in Cube

Let’s open the errorpercentiles.js schema file. Below the SQL query, we put the preAggregations configuration:

preAggregations: {
  cacheResults: {
    type: `rollup`,
    measures: [
      errorpercentiles.perc_10, errorpercentiles.perc_10_value,
      errorpercentiles.perc_20, errorpercentiles.perc_20_value,
      errorpercentiles.perc_30, errorpercentiles.perc_30_value,
      errorpercentiles.perc_40, errorpercentiles.perc_40_value,
      errorpercentiles.perc_50, errorpercentiles.perc_50_value,
      errorpercentiles.perc_60, errorpercentiles.perc_60_value,
      errorpercentiles.perc_70, errorpercentiles.perc_70_value,
      errorpercentiles.perc_80, errorpercentiles.perc_80_value,
      errorpercentiles.perc_90, errorpercentiles.perc_90_value,
      errorpercentiles.perc_100, errorpercentiles.perc_100_value
    ],
    dimensions: [errorpercentiles.modelName, errorpercentiles.modelVariant],
    refreshKey: {
      every: `12 hour`,
    },
  },
},

After testing the development version, we can also deploy the changes to production using the “Commit & Push”button. When we click it, we will be asked to type the commit message:

An empty “Commit Changes & Push”view

When we commit the changes, the deployment of a new version of the endpoint will start. A few minutes later, we can start sending queries to the endpoint.

We can also check the pre-aggregations window to verify whether Cube successfully created the cached data.

Successfully cached pre-aggregations

Now, we can move to the Playground tab and run our query. We should see the “Query was accelerated with pre-aggregation”message if Cube used the cached values to handle the request.

The message that indicates that our pre-aggregation works correctly

Building the front-end application

Cube can connect to a variety of tools, including Jupyter Notebooks, Superset, and Hex. However, we want a fully customizable dashboard, so we will build a front-end application.

Our dashboard consists of two parts: the website and the back-end service. In the web part, we will have only the code required to display the charts. In the back-end, we will handle authentication and authorization. The backend service will also send requests to the Cube REST API.

Getting the Cube API key and the API URL

Before we start, we have to copy the Cube API secret. Open the settings page in Cube Cloud’s web UI and click the “Env vars”tab. In the tab, you will see all of the Cube configuration variables. Click the eye icon next to the CUBEJS_API_SECRET and copy the value.

The Env vars tab on the settings page

We also need the URL of the Cube endpoint. To get this value, click the “Copy API URL” link in the top right corner of the screen.

The location of the Copy API URL link

Back end for front end

Now, we can write the back-end code.

First, we have to authenticate the user. We assume that you have an authentication service that verifies whether the user has access to your dashboard and which models they can access. In our examples, we expect those model names in an array stored in the allowedModels variable.

After getting the user’s credentials, we have to generate a JWT to authenticate Cube requests. Note that we have also defined a variable for storing the CUBE_URL. Put the URL retrieved in the previous step as its value.

‚Äã‚Äãconst jwt = require('jsonwebtoken');
CUBE_URL = '';
function create_cube_token() {
  const CUBE_API_SECRET = your_token; // Don’t store it in the code!!!
  // Pass it as an environment variable at runtime or use the
  // secret management feature of your container orchestration system

  const cubejsToken = jwt.sign(
    { "models": allowedModels },
    CUBE_API_SECRET,
    { expiresIn: '30d' }
  );
  
  return cubejsToken;
}

We will need two endpoints in our back-end service: the endpoint returning the chart data and the endpoint retrieving the names of models and variants we can access.

We create a new express application running in the node server and configure the /models endpoint:

const request = require('request');
const express = require('express')
const bodyParser = require('body-parser')
const port = 5000;
const app = express()

app.use(bodyParser.json())
app.get('/models', getAvailableModels);

app.listen(port, () => {
  console.log(`Server is running on port ${port}`)
})

In the getAvailableModels function, we query the Cube Cloud API to get the model names and variants. It will return only the models we are allowed to see because we have configured the Cube security context:

Our function returns a list of objects containing the modelName and modelVariant fields.

function getAvailableModels(req, res) {
  res.setHeader('Content-Type', 'application/json');
  request.post(CUBE_URL + '/load', {
    headers: {
      'Authorization': create_cube_token(),
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({"query": {
      "dimensions": [
        "errorpercentiles.modelName",
        "errorpercentiles.modelVariant"
      ],
      "timeDimensions": [],
      "order": {
        "errorpercentiles.modelName": "asc"
      }
    }})
  }, (err, res_, body) => {
    if (err) {
      console.log(err);
    }
    body = JSON.parse(body);
    response = body.data.map(item => {
      return {
        modelName: item["errorpercentiles.modelName"],
        modelVariant: item["errorpercentiles.modelVariant"]
      }
    });
    res.send(JSON.stringify(response));
  });
};

Let’s retrieve the percentiles and percentile buckets. To simplify the example, we will show only the query and the response parsing code. The rest of the code stays the same as in the previous endpoint.

The query specifies all measures we want to retrieve and sets the filter to get data belonging to a single model’s variant. We could retrieve all data at once, but we do it one by one for every variant.

{
  "query": {
    "measures": [
      "errorpercentiles.perc_10",
      "errorpercentiles.perc_20",
      "errorpercentiles.perc_30",
      "errorpercentiles.perc_40",
      "errorpercentiles.perc_50",
      "errorpercentiles.perc_60",
      "errorpercentiles.perc_70",
      "errorpercentiles.perc_80",
      "errorpercentiles.perc_90",
      "errorpercentiles.perc_100",
      "errorpercentiles.perc_10_value",
      "errorpercentiles.perc_20_value",
      "errorpercentiles.perc_30_value",
      "errorpercentiles.perc_40_value",
      "errorpercentiles.perc_50_value",
      "errorpercentiles.perc_60_value",
      "errorpercentiles.perc_70_value",
      "errorpercentiles.perc_80_value",
      "errorpercentiles.perc_90_value",
      "errorpercentiles.perc_100_value"
    ],
    "dimensions": [
        "errorpercentiles.modelName",
        "errorpercentiles.modelVariant"
    ],
    "filters": [
      {
        "member": "errorpercentiles.modelName",
        "operator": "equals",
        "values": [
          req.query.model
        ]
      },
      {
        "member": "errorpercentiles.modelVariant",
        "operator": "equals",
        "values": [
          req.query.variant
        ]
      }
    ]
  }
}

The response parsing code extracts the number of values in every bucket and prepares bucket labels:

response = body.data.map(item => {
  return {
    modelName: item["errorpercentiles.modelName"],
    modelVariant: item["errorpercentiles.modelVariant"],
    labels: [
      "<=" + item['percentiles.perc_10_value'],
      item['errorpercentiles.perc_20_value'],
      item['errorpercentiles.perc_30_value'],
      item['errorpercentiles.perc_40_value'],
      item['errorpercentiles.perc_50_value'],
      item['errorpercentiles.perc_60_value'],
      item['errorpercentiles.perc_70_value'],
      item['errorpercentiles.perc_80_value'],
      item['errorpercentiles.perc_90_value'],
      ">=" + item['errorpercentiles.perc_100_value']
    ],
    values: [
      item['errorpercentiles.perc_10'],
      item['errorpercentiles.perc_20'],
      item['errorpercentiles.perc_30'],
      item['errorpercentiles.perc_40'],
      item['errorpercentiles.perc_50'],
      item['errorpercentiles.perc_60'],
      item['errorpercentiles.perc_70'],
      item['errorpercentiles.perc_80'],
      item['errorpercentiles.perc_90'],
      item['errorpercentiles.perc_100']
    ]
  }
})

Dashboard website

In the last step, we build the dashboard website using Vue.js.

If you are interested in copy-pasting working code, we have prepared the entire example in a CodeSandbox. Below, we explain the building blocks of our application.

We define the main Vue component encapsulating the entire website content. In the script section, we will download the model and variant names. In the template, we iterate over the retrieved models and generate a chart for all of them.

We put the charts in the Suspense component to allow asynchronous loading.

To keep the example short, we will skip the CSS style part.

<script setup>
  import OwnerName from './components/OwnerName.vue'
  import ChartView from './components/ChartView.vue'
  import axios from 'axios'
  import { ref } from 'vue'
  const models = ref([]);
  axios.get(SERVER_URL + '/models').then(response => {
    models.value = response.data
  });
</script>

<template>
  <header>
    <div class="wrapper">
      <OwnerName name="Test Inc." />
    </div>
  </header>
  <main>
    <div v-for="model in models" v-bind:key="model.modelName">
      <Suspense>
        <ChartView v-bind:title="model.modelName" v-bind:variant="model.modelVariant" type="percentiles"/>
      </Suspense>
    </div>
  </main>
</template>

The OwnerName component displays our client’s name. We will skip its code as it’s irrelevant in our example.

In the ChartView component, we use the vue-chartjs library to display the charts. Our setup script contains the required imports and registers the Chart.js components:

Äãimport { Bar } from 'vue-chartjs'
import { Chart as ChartJS, Title, Tooltip, Legend, BarElement, CategoryScale, LinearScale } from 'chart.js'
import { ref } from 'vue'
import axios from 'axios'
ChartJS.register(Title, Tooltip, Legend, BarElement, CategoryScale, LinearScale);

We have bound the title, variant, and chart type to the ChartView instance. Therefore, our component definition must contain those properties:

const props = defineProps({
  title: String,
  variant: String,
  type: String
})

Next, we retrieve the chart data and labels from the back-end service. We will also prepare the variable containing the label text:

const response = await axios.get(SERVER_URL + '/' + props.type + '?model=' + props.title + '&variant=' + props.variant)
const data = response.data[0].values;
const labels = response.data[0].labels;
const label_text = "Number of prediction errors of a given value"

Finally, we prepare the chart configuration variables:

const chartData = ref({
  labels: labels,
  datasets: [
    {
      label: label_text,
      backgroundColor: '#f87979',
      data: data
    }
  ],
});

const chartOptions = {
  plugins: {
    title: {
      display: true,
      text: props.title + ' - ' + props.variant,
    },
  },
  legend: {
    display: false
  },
  tooltip: {
    enabled: false
  }
}

In the template section of the Vue component, we pass the configuration to the Bar instance:

<template>
  <Bar ref="chart" v-bind:chart-data="chartData" v-bind:chart-options="chartOptions" />
</template>

If we have done everything correctly, we should see a dashboard page with error distributions.

Charts displaying the error distribution for different model variants

Wrapping up

Thanks for following this tutorial.

We encourage you to spend some time reading the Cube and Ahana documentation.

Please don’t hesitate to like and bookmark this post, write a comment, give Cube a star on GitHub, join Cube’s Slack community, and subscribe to the Ahana newsletter.

Athena Limitations & AWS Athena Limits | Comparing AWS Athena

Welcome to our blog series on comparing AWS Athena, a serverless Presto service, to open source PrestoDB. In this series we’ll discuss Amazon’s Athena service versus PrestoDB and some of the reasons why you might choose to deploy PrestoDB on your own instead of using the AWS Athena service. We hope you find this series helpful.

AWS Athena is an interactive query service built on PrestoDB that developers use to query data stored in Amazon S3 using standard SQL. It has a serverless architecture and Athena users pay per query (it’s priced at $5 per terabyte scanned). Some of the common Amazon Athena limits are technical limitations that include query limits, concurrent queries limits, and partition limits. AWS Athena limits performance, as it runs slowly and increases operational costs. Plus, AWS Athena is built on an old version of PrestoDB and only supports a subset of PrestoDB features.

An overview on AWS Athena limits

AWS Athena query limits can cause problems, and many data engineering teams have spent hours trying to diagnose them. Some limits are hard, while some are soft quotas that you can request AWS to increase. One big limitation is around Athena’s limitations on queries: Athena users can only submit one query at a time and can only run up to five queries simultaneously for each account by default.

AWS Athena query limits

AWS Athena Data Definition Language (DDL, like CREATE TABLE statements) and Data Manipulation Language (DML, like DELETE and INSERT) have the following limits: 

1.    Athena DDL max query limit: 20 DDL active queries . 

2.    Athena DDL query timeout limit: The Athena DDL query timeout is 600 minutes.

3.    Athena DML query limit: Athena only allows you to have 25 DML queries (running and queued queries) in the US East and 20 DML  queries in all other Regions by default.     

4.    Athena DML query timeout limit: The Athena DML query timeout limit is 30 minutes. 

5.    Athena query string length limit: The Athena query string hard limit is 262,144 bytes. 

Learn More About Athena Query Limits

We have put together a deep dive into Athena Query limits in Part 2 of this series, which you can read by following the link below:

AWS Athena partition limits

  1. Athena’s users can use AWS Glue, a data catalog and  ETL service. Athena’s partition limit is 20,000 per table and Glue’s limit is 1,000,000 partitions per table. 
  2. A Create Table As (CTAS) or INSERT INTO query can only create up to 100 partitions in a destination table. To work around this limitation you must manually chop up your data by running a series of INSERT INTOs that insert up to 100 partitions each.

Athena database limits

AWS Athena also has the following S3 bucket limitations: 

1.    Amazon S3 bucket limit is 100* buckets per account by default – you can request to increase it up to 1,000 S3 buckets per account.           

3.    Athena restricts each account to 100* databases, and databases cannot include over 100* tables.

*Note, recently Athena has increased this to 10K databases per account and 200K tables per database.

AWS Athena open-source alternative

Deploying your own PrestoDB cluster

An AWS Athena alternative is deploying your own PrestoDB cluster. AWS Athena is built on an old version of PrestoDB – in fact, it’s about 60 releases behind the PrestoDB project. Newer features are likely to be missing from Athena (and in fact it only supports a subset of PrestoDB features to begin with).

Deploying and managing PrestoDB on your own means you won’t have AWS Athena limitations such as the athena concurrent queries limit, concurrent queries limits, database limits, table limits, partitions limits, etc. Plus you’ll get the very latest version of Presto. PrestoDB is an open source project hosted by The Linux Foundation’s Presto Foundation. It has a transparent, open, and neutral community. 

If deploying and managing PrestoDB on your own is not an option (time, resources, expertise, etc.), Ahana can help.

Ahana Cloud for Presto: A fully managed service

Ahana Cloud for Presto is a fully managed Presto cloud service without the limits of AWS Athena.

You use AWS to query and analyze AWS data lakes stored in Amazon S3, and many other data sources, using the latest version of PrestoDB. Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, speed and ease of use. Ahana is a SaaS offering via a beautiful and easy to use console UI. Anyone at any knowledge level can use it with ease, there is zero configuration effort and no configuration files to manage. Many companies have moved from AWS Athena to Ahana Cloud.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

Up next: AWS Athena Query Limits

Related Articles 

Athena vs Presto

Learn the differences between Presto and Ahana and understand the pros and cons.

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

ahana logo

What is AWS Redshift Spectrum?

Introduction

What is Redshift Spectrum? Since there is a shared name with AWS Redshift, there is some confusion as to what AWS Redshift Spectrum is. To discuss that however, it’s important to know what AWS Redshift is, namely an Amazon data warehouse product that is based on PostgreSQL version 8.0.2.

Launched in 2017, Redshift Spectrum is a feature within Redshift that enables you to query data stored in AWS S3 using SQL. Spectrum allows you to do federated queries from within the Redshift SQL query editor to data in S3, while also being able to combine it with data in Redshift.

Benefits of Redshift Spectrum

When compared to a similar object-store SQL engine available from Amazon such as Athena, Redshift has significantly higher and more consistent performance. Athena uses pooled resources while Spectrum is based on your Redshift cluster size and is, therefore, a known quantity.

Spectrum allows you to access your data lake files from within your Redshift data warehouse without having to go through an ingestion process. This makes data management easier, while also reducing data latency since you aren’t waiting for ETL jobs to be written and processed.

With Spectrum, you continue to use SQL to connect to and read AWS S3 object stores in addition to Redshift, which means there are no new tools to learn and it allows you to leverage your existing skillsets. Under the hood, Spectrum is breaking the user queries into filtered subsets that run concurrently. These can be distributed across thousands of nodes to enhance the performance and can be scaled to query exabytes of data. The data is then sent back to your Redshift cluster for final processing.

Performance & Price

Redshift Spectrum is going to be as fast as the slowest data store in your aggregated query. If you are joining from Redshift to a terabyte-sized CSV file, the performance will be extremely slow. Connecting to a well-partitioned collection of column-based Parquet stores on the other hand will be much faster. Not having indexes on the object stores means that you really have to rely on the efficient organization of the files to get higher performance.

As to price, Spectrum follows the terabyte scan model that Amazon uses for a number of its products. You are billed per terabyte of data scanned, rounded up to the next megabyte, with a 10 MB minimum per query. For example, if you scan 10 GB of data, you will be charged $0.05. If you scan 1 TB of data, you will be charged $5.00. This does not include any fees for the Redshift cluster or the S3 storage.

Redshift and Redshift Spectrum Use Case

An example of combining Redshift and Redshift Spectrum could be a high-velocity eCommerce site that sells apparel. Your historical order history is contained in your Redshift data warehouse, but real-time orders are coming in through a Kafka stream and landing in S3 in Parquet format. Your organization needs to make an order decision for particular items because there is a long lead time. Redshift knows what you have done historically, but that S3 data is only processed monthly into Redshift. With Spectrum, the query can combine what is in Redshift and join that with the Parquet files on S3 to get an up-to-the-minute view of order volume so a more informed decision can be made.

Summary

Amazon Redshift Spectrum provides a layer of functionality to Redshift that allows you to interact with object stores in AWS S3 without building a whole other tech stack. It makes sense for companies who are using Redshift and need to stay there, but also need to make use of the data lake, or for companies that are considering leaving Redshift behind and going entirely to the data lake. Redshift Spectrum does not make sense for you if all your files are in the data lake. Spectrum is very expensive as the data grows, with no visibility on the queries, this is where a managed service like Ahana for Presto fits in.

Lake-Formation-architecture

How to Build a Data Lake Using Lake Formation on AWS

Introduction

Briefly, AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach. AWS data lake formation builds on and works with the capabilities found in AWS Glue.

How it Works

Your root user can’t be your administrator for your data lake, so the first thing you want to do is create a new user that has full admin rights. Go to IAM and create that user and give them AdministratorAccess capability. Next, create an S3 bucket and any data directories you are going to use if you don’t already have something configured. Do that in the S3 segment of AWS as you would normally. If you already have an S3 location setup, you can skip that step. In either case, we then need to register that data lake location in Lake Formation. The Lake Formation menu looks like this:

Menu - AWS Lake Formation

Now with your Lake Formation registered data sources, you can create a database from those sources in Lake Formation, and from there, create your Glue Crawlers. The crawler will take that database that you created, and go into the S3 bucket, read the directory structure and files to create your tables and fields within the database. Once you’ve run your Crawler, you’ll see the tables and fields reflected under “Tables”. The crawler creates a meta-data catalog that provides the descriptions of the underlying data that is then presented to other tools to access, such as AWS Quicksight and Ahana Presto. Amazon provides this diagram:

AWS Lake Formation architecture

To summarize thus far, we’ve 

  • Created an admin user
  • Created an S3 bucket
  • Created three directories in the S3 bucket
  • Registered the S3 bucket as a data lake location

Benefits

Having your data repositories registered and then created as a database in Lake Formation provides a number of advantages in terms of centralization of work. Fundamentally, the role of Lake Formation is to control access to data that you register. A combination of IAM roles and “Data lake permissions” is how you control this on a more macro level. Amazon shows the flow this way:

data lake permissions flow

Where the major advantages lie however, are with the “LF-Tags” and “LF-tag permissions”. This is where your granular security can be applied in a way that will greatly simplify your life. Leveraging Lake Formation we have two ways to assign and manage permissions to our catalog resources. There is “Named” based access and “Tag” based access.

manage permissions in catalog

Named-based access is what most people are familiar with. You select the principal, which can be an AWS user or group of users, and assign it access to a specific database or table. The Tag-based access control method uses Lake Formation tags, called “LF Tags”. These are attributes that are assigned to the data catalog resources, such as databases, tables, and even columns, to principals in our AWS account to manage authorizations to these resources. This is especially helpful with environments that are growing and/or changing rapidly where policy management can be onerous. Tags are essentially Key/Value stores that define these permissions:

  • Tags can be up to 128 characters long
  • Values can be up to 256 characters long
  • Up to 15 values per tag
  • Up to 50 LF-Tags per resource

Use Cases

If we wanted to control access to an employee table for example, such that HR could see everything, everyone in the company could see the names, titles, and departments of employees, and the outside world could only see job titles, we could set that up as:

  • Key = Employees
  • Values = HR, corp, public

Using this simplified view as an example:

lake formation use case flow_example

We have resources “employees” and “sales”, each with multiple tables, with multiple named rows. In a conventional security model, you would give the HR group full access to the employees resource, but all of the corp group would only have access to the “details” table. What if you needed to give access to position.title and payroll.date to the corp group? We would simply add the corp group LF Tag to those fields in addition to the details table, and now they can read those specific fields out of the other two tables, in addition to everything they can read in the details table. The corp group LF Tag permissions would look like this:

  • employees.details
  • employees.position.title
  • employees.payroll.date

If we were to control by named resources, it would require that each named person would have to be specifically allocated access to those databases and tables, and often there is no ability to control by column, so that part wouldn’t even be possible at a data level.

Summary

AWS Lake Formation really simplifies the process whereby you set up and manage your data lake infrastructure. Where it really shines is in the granular security that can be applied through the use of LF Tags. An AWS Lake Formation tutorial that really gets into the nitty-gritty can be found online from AWS or any number of third parties on YouTube. The open-source data lake has many advantages over a data warehouse and Lake Formation can help establish best practices and simplify getting started dramatically.

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

Reporting and dashboarding diagram

The Open Data Lakehouse – a quick intro

Data warehouses have been considered a standard to perform analytics on structured data but cannot handle unstructured data such as text, images, audio, video and other formats. Additionally, machine learning and AI are becoming common in every aspect of business and they need access to vast amounts of data outside of data warehouses.

The cloud transformation has triggered the disaggregation of compute and storage which leverages cost benefits and enables adaptability to store data coming from multiple dimensions. All this has led to a new data platform architecture called the Open Data Lakehouse. It solves the challenges of the traditional cloud data warehouse through its use of open source and open format technologies such as Presto and Hudi. In this blog you will learn more about open data lake analytics stack using Presto, Hudi, and AWS S3. 

What is an Open Data Lakehouse

The Open Data Lakehouse is based on the concept of bringing your warehouse workloads to the data lake. You can run analytics on technology and tools that do not require any vendor lock-in including licensing, data formats, interfaces and infrastructure.

Four key elements include:

Open source – The technologies on the stack we will be exploring for Open Data lake Analytics are completely open source under the Apache 2.0 license. This means that you benefit from the best innovations, not just from one vendor but from the entire community. 

Open formats – Also they don’t use any proprietary formats. In fact, it supports most of the common formats like JSON, Apache ORC, Apache Parquet and others.

Open interfaces – The interfaces are industry standard ANSI SQL compatible and standard JDBC / ODBC drivers can be used to connect to any reporting / dashboarding / notebook tool. And because it is open source, industry standard language clauses continue to be added in and expanded on. 

Open cloud – The stack is cloud agnostic and without storage natively aligns with containers and can be run on any cloud. 

Why Open Data Lakehouse

Open data lakehouses allow consolidation of structured and unstructured data in a central repository at cheaper cost and removes the complexity of running ETL, resulting in high performance and reducing cost and time to run analytics.

  • Bringing compute to your data (decouple of compute and storage)
  • Flexibility at the governance/transaction layer
  • Flexibility and low cost to store structured and semi/unstructured data
  • Flexibility at every layer – pick and choose which technology works best for your workloads/use case

Open Data Lakehouse architecture

Now let’s dive into the stack itself and each of the layers. We’ll discuss what problems each layer solves for.

The next EDW is the Open Data Lakehouse. Learn the data lakehouse format.

BI/Application tools – Data Visualization, Data Science tools

Plug in your BI/analytical application tool of choice. The Open Data Lake Analytics stack supports the use of JDBC/ODBC drivers so you can connect Tableau, Looker, preset, jupyter notebook, etc. based on your use case and workload. 

Presto – SQL Query Engine for the Data Lake

Presto is a parallel distributed SQL query engine for the data lake. It enables interactive, ad-hoc analytics on large amounts of data on data lakes. With Presto you can query data where it lives, including data sources like AWS S3, relational databases, NoSQL databases, and some proprietary data stores. 

Presto is built for high performance interactive querying with in-memory execution 

Key characteristics include: 

  • High scalability from 1 to 1000s of workers
  • Flexibility to support a wide range of SQL use cases
  • Highly pluggable architecture that makes it easy to extend Presto with custom integrations for security, event listeners, etc.
  • Federation of data sources particularly data lakes via Presto connectors
  • Seamless integration with existing SQL systems with ANSI SQL standard
deploying presto clusters

A full deployment of Presto has a coordinator and multiple workers. Queries are sub‐ mitted to the coordinator by a client like the command line interface (CLI), a BI tool, or a notebook that supports SQL. The coordinator parses, analyzes and creates the optimal query execution plan using metadata and data distribution information. That plan is then distributed to the workers for processing. The advantage of this decoupled storage model is that Presto is able to provide a single view of all of your data that has been aggregated into the data storage tier like S3.

Apache Hudi – Streaming Transactions in the Open Data Lake

One of the big drawbacks in traditional data warehouses is keeping the data updated. It requires building data mart/cubes then doing constant ETL from source to destination mart, resulting in additional time, cost and duplication of data. Similarly, data in the data lake needs to be updated  and consistent without that operational overhead. 

A transactional layer in your Open Data Lake Analytics stack is critical, especially as data volumes grow and the frequency at which data must be updated continues to increase. Using a technology like Apache Hudi solves for the following: 

  • Ingesting incremental data
  • Changing data capture, both insert and deletion
  • Incremental data processing
  • ACID transactions

Apache Hudi, which stands for Hadoop Upserts Deletes Incrementals, is an open-source based transaction layer with storage abstraction for analytics developed by Uber. In short, Hudi enables atomicity, consistency, isolation, and durability (ACID) transactions in a data lake. Hudi uses open file formats Parquet and Avro for data storage and internal table formats known as Copy-On-Write and Merge-On-Read.

It has built-in integration with Presto so you can query “hudi datasets” stored on the open file formats.

Hudi Data Management

Hudi has a table format which is based on directory structure and the table will have partitions, which are folders containing data files for that partition. It has indexing capabilities to support fast upserts. Hudi has two table types defining how data is indexed and layed out which defines how the underlying data is exposed to queries.

Hudi data management

(Image source: Apache Hudi)

  • Copy-On-Write (COW): Data is stored in Parquet file format (columnar storage), and each new update creates a new version of files during a write. Updating an existing set of rows will result in a rewrite of the entire parquet files for the rows being updated.
  • Merge-On-Read (MOR): Data is stored in a combination of Parquet file format (columnar) and Avro (row-based) file formats. Updates are logged to row-based delta files until compaction, which will produce new versions of the columnar files.

Based on the two table types Hudi provides three logical views for querying data from the Data Lake.

  • Read-optimized – Queries see the latest committed dataset from CoW tables and the latest compacted dataset from MoR tables
  • Incremental – Queries see new data written to the table after a commit/compaction. This helps to build incremental data pipelines and it’s analytics.
  • Real-time – Provides the latest committed data from a MoR table by merging the columnar and row-based files inline

AWS S3 – The Data Lake

The data lake is the central location for storing data from disparate sources such as structured, semi-structured and unstructured data and in open formats on object storage such as AWS S3.

Amazon Simple Storage Service (Amazon S3) is the de facto centralized storage to implement Open Data Lake Analytics.

Getting Started:

How to run Open data lake analytics workloads using Presto to query Apache Hudi datasets on S3

Now that you know the details of this stack, it’s time to get started. Here I’ll quickly show how you can actually use Presto to query your Hudi datasets on S3.

Ingest your data into AWS S3 and query with Presto

Data can be ingested on Data lake from different sources such as kafka and other databases, by introducing hudi into the data pipeline the needed Hudi tables will be created/updated and the data will be stored in either Parquet or Avro format based on the table type in S3 Data Lake. Later BI Tools/Application can query data using Presto which will reflect updated results as data gets updated.

Conclusion:

The Open Data Lake Analytics stack is becoming more widely used because of its simplicity, flexibility, performance and cost.

The technologies that make up that stack are critical. Presto, being the de-facto SQL query engine for the data lake, along with the transactional support and change data capture capabilities of Hudi, make it a strong open source and open format solution for data lake analytics but a missing component is Data Lake Governance which allows to run queries on S3 more securely. AWS has recently introduced Lake formation, a data governance solution for data lake and Ahana, a managed service for Presto seamlessly integrates Presto with AWS lake formation to run interactive queries on your AWS S3 data lakes with fine grained access to data.

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

How to Build a Data Lake Using Lake Formation on AWS

AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach.

ahana logo

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand and is compatible with multitudes of AWS tools and technologies. AWS Redshift is considered the preferred cloud data warehouse of choice for most customers but the pricing is not simple, since it tries to accommodate different use cases and customers. Let us try to understand the pricing details of Amazon Redshift.

Understanding node types pricing

The Redshift cluster consists of multiple nodes allowing it to process data faster. This means Redshift performance depends on the node types and number of nodes. The node types can be dense compute nodes or Redshift managed storage nodes.

Dense Compute nodes: These nodes offer physical memory up to 244GB and storage capacity on SSD up to 2.5TB.

RA3 with managed Storage nodes: These nodes have physical memory up to 384GB and storage capacity on SSD up to 128TB. Additionally, when storage runs out on the nodes, Redshift will offload the data into S3. Below pricing for RA3 does not include the cost of managed storage.

Understanding node types pricing

Redshift spectrum pricing

Redshift spectrum is a serverless offering that allows running SQL queries directly against an AWS S3 data lake and it’s priced based on data scanned per TB.

redshift spectrum pricing example

Concurrency scaling

Amazon Redshift allows you to grab additional resources as needed and release them when they are not needed. Every day of typical usage up to one hour is free but every second beyond that is charged for additional resource usage.

A pricing example as stated by Amazon Redshift, A 10 DC2.8XL node Redshift cluster in the US-East will cost $48 per hour. Consider a scenario where two transient clusters are utilized for 5 mins beyond the free concurrency scaling credits. The per-second on-demand rate will be $48 X 1/3600 = $0.13 per second. The additional cost for concurrency scaling, in this case, is 0.013 per second X 300 seconds x 2 transient clusters = $8

Amazon Redshift managed storage pricing(RMS)

Managed storage comes with RA3 node types. Usage of managed storage is calculated hourly based on the total GB stored. Managed storage does not include backup storage charges due to automated and manual snapshots.

Amazon Redshift managed storage pricing(RMS)

Pricing example for managed storage pricing

100 GB stored for 15 days: 100 GB X 15days x (24hours/day) =36000 GB/hours

100 TB stored for 15days: 100TB X 1024GB/TB X 15 days x (24 hours/day) = 36,864000 GB-hours

Total usage in GB-hours: 36,000 GB-Hours + 36,864000 GB-hours = 36,900,000 GB-hours

Total usage in GB-Month = 36,900,000/720 hours per month = 51,250 GB-months

Total charges for the month will be 51,250 GB-month X $0.024 = $1230

Limitation of Redshift Pricing

As you can see, Redshift has few selected instance types with limited storage.

Customers could easily hit the ceiling in terms of the node storage, and Redshift managed storage is expensive for data growth.

Redshift spectrum (a serverless option) $5 scan per TB will be an expensive option and removes the ability for the customer to scale up/down the nodes to meet their performance requirements.

Due to these limitations, Redshift is often a less than ideal solution for use cases that require diverse access to very large volumes of data, such as exploratory data science and machine learning. In these cases, many organizations would gravitate towards storing the data on Amazon S3 in a data lakehouse architecture.

If your organization is struggling to accommodate advanced use cases in Redshift, or managing increasing cloud storage costs, check out Ahana – a powerful managed service for Presto which provides SQL on S3. Unlike Redshift spectrum, Ahana allows customers to choose the right instance type and scale up/down as needed and comes with a simple pricing model on the number of compute instances.

Want to learn from a real-life example? See how Blinkit cut their data delivery time from 24 hours to 10 minutes by moving from Redshift to Ahana – watch the case study here.

On-Demand-lake-to-shining-lakehouse

On-Demand Presentation

FROM LAKE TO SHINING LAKEHOUSE, A NEW ERA IN DATA

From data warehousing to data lakes, and now with so-called Data Lakehouses, we’re seeing an ever greater appreciation for the importance of architecture. The success of Snowflake proved that data warehousing is a live and well; but that’s not to say that data lakes aren’t viable. The key is to find a balance of both worlds, thus enabling the rapid analysis afforded by warehousing, and the strategic agility via explorative ad hoc queries that data lakes provide. During this episode of DM Radio you will learn from experts Raj K of General Dynamics Information technology, and Wen Phan of Ahana.


Speakers

K Raj

Wen Phan

ahana logo
Wen-Phan_Picture

Eric Kavanaugh

eric kavanaugh

Presto vs Snowflake: Data Warehousing Comparisons

ahana logo

Snowflake vs Presto

This article touches on several basic elements to compare Presto and Snowflake.

To start, let’s define what each of these is. Presto is an open-source SQL query engine for data lakehouse analytics. It’s well known for ad hoc analytics on your data. One important thing to note is that Presto is not a database. You can’t store data in Presto but use it as a compute engine for your Data Lakehouse. You can use presto on not just the public cloud but as well as on private cloud infrastructures (on-premises or hosted).

Snowflake is a cloud data warehouse that offers a cloud-based data storage and analytics service. Snowflake runs completely on cloud infrastructure. Snowflake uses virtual compute instances for its compute needs and storage service for persistent storage of data. Snowflake cannot be run on private cloud infrastructures (on-premises or hosted).

Use cases: Snowflake vs. Presto

Snowflake is a cloud solution for your traditional data warehouse workloads such as reporting and dashboards. It is good for small-scale workloads; to move traditional batch-based reporting and dashboard-based analytics to the cloud. I discuss this limitation in the Scalability and Concurrency topic. 

Presto is not only a solution for reporting & dashboarding. With its connectors and their in-place execution, platform teams can quickly provide access to datasets that analysts have an interest in. Presto can also run queries in seconds. You can aggregate terabytes of data across multiple data sources and run efficient ETL queries. With Presto, users can query data across many different data sources including databases, data lakes, and data lakehouses.

Open Source Or Vendor lock-in

Snowflake is not Open Source Software. Data that has been aggregated and moved into Snowflake is in a proprietary format only available to Snowflake users. Surrendering all your data to the Snowflake data cloud model is the ideal recipe for vendor lock-in. 

Vendor Lock-In can lead to:

  • Excessive cost as you grow your data warehouse
  • When ingested into another system, data is typically locked into the formats of a closed source system
  • No community innovations or ways to leverage other innovative technologies and services to process that same data

Presto is an Open Source project, under the Apache 2.0 license, hosted by the Linux Foundation. Presto benefits from community innovation. An open-source project like Presto has many contributions from engineers across Twitter, Uber, Facebook, Bytedance, Ahana, and many more. Dedicated Ahana engineers are working on the new PrestoDB C++ execution engine aiming to bring high-performance data analytics to the Presto ecosystem. 

Open File Formats

Snowflake has chosen to use a micro-partition file format that is good for performance but closed source. The Snowflake engine cannot work directly with common open formats like Apache Parquet, Apache Avro, Apache ORC, etc. Data can be imported from these open formats to an internal Snowflake file format, but users miss out on performance optimizations that these open formats can bring to the engine, including dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes, avoid many small files, avoid few huge files, etc. 

On the other hand, Presto users can run ad-hoc, real-time analytics, with deep learning, on those same source files previously mentioned, without needing to copy files, so there’s more flexibility that users get with this open data lake architecture. Using open formats gives users the flexibility to pick the right engine for the right job without the need for an expensive migration. 

Open transaction format

Many organizations are adopting Data Lakehouse architecture and augmenting their current data warehouse. This brings the need for a transaction manager layer that can be supported by Apache Hudi, Apache Iceberg, or Delta Lake. Snowflake does not support all of these table formats. Presto supports all these table formats natively, allowing users more flexibility and choice. With ACID transaction support from these table formats, Presto is the SQL engine for Open Data Lakehouse. Moreover, Snowflake data warehouse doesn’t support semi/unstructured data workloads, AI/ML/data science workloads, whereas the data lakehouse does. 

Data Ownership

While Snowflake did decouple storage and compute, they did not decouple data ownership. . They  still own the compute layer as well as the storage layer. This means users must ingest data into Snowflake using a proprietary format, creating yet another copy of data and also requiring users to move their data out of their own environment. Users lose ownership of their data.

On other hand, Presto is a truly disaggregated stack that allows you to run your queries in a federated manner without any need to move your data and create multiple copies. At Ahana, users can define Presto clusters, and orchestrate and manage them in their own AWS account using cross-account roles. 

Scalability and Concurrency

With Snowflake you hit a limitation of running maximum concurrent users on a single virtual warehouse. If you have more than eight concurrent users, then you need to initiate another virtual warehouse. Query performance is good for simple queries, however, performance degrades as you apply more complex joins on large datasets and the only options available are limiting the data that you can query with Snowflake or adding more compute. Parallel writes also impact read operations and the recommendation is to have separate virtual warehouses.

Presto is designed from the ground up for fast analytic queries against data sets of any size and has been proven on petabytes of data, and supports 10-50s concurrent queries at a time

Cost of Snowflake

Users think of Snowflake as an easy and low-cost model. However, it gets very expensive and cost-prohibitive to ingest data into Snowflake. Very large amounts of data and enterprise-grade, long-running queries can result in significant costs associated with Snowflake as it requires the addition of more virtual data warehouses which can rapidly escalate costs. Basic performance improvement features like Materialized Views come with additional costs. As Snowflake is not fully decoupled, data is copied and stored into Snowflake’s managed cloud storage layer within Snowflake’s account. Hence, the users end up paying a higher cost to Snowflake than the cloud provider charges, not to mention the costs associated with cold data. Further, security features come at a higher price with a proprietary tag.

Open Source Presto is completely free. Users can run on-prem or in a cloud environment. Presto allows you to leave your data in the lowest cost storage options. You can create a portable query abstraction layer to future-proof your data architecture. Costs are for infrastructure, with no hidden cost for premium features. Data federation with Presto allows users to shrink the size of their data warehouse. By accessing the data where it is, users may cut the expenses of ETL development and maintenance associated with data transfer into a data warehouse. With Presto, you can also leverage storage savings by storing “cold” data in low-cost options like a data lake and “hot” data in a typical relational or non-relational database. 

Snowflake vs. Presto: In Summary

Snowflake is a well-known cloud data warehouse, but sometimes users need more than that – 

  1. Immediate data access as soon as it is written in a federated manner
  2. Eliminate lag associated with ETL migration when you can directly query from the source
  3. Flexible environment to run unstructured/ semi-structured or machine learning workloads
  4. Support for open file formats and storage standards to build open data lakehouse
  5. Open-source technologies to avoid vendor lock-in
  6. The cost-effective solution that is optimized for high concurrency and scalability. 

Presto can solve all these user needs in a more flexible, open-source, secure, scalable, secure, and cost-effective way. 

SaaS for Presto

If you want to use Presto, we’ve made it easy to get started in AWS. Ahana is a SaaS for Presto. With Ahana for Presto, you can run in containers on Amazon EKS making the service highly scalable & available. We have optimized Presto clusters with scale up and down compute as necessary which helps companies achieve cost control. With Ahana Cloud, you can easily integrate Presto with Apache Ranger or AWS Lake Formation and address your fine-grained access control needs. Creating a data lake with Presto and AWS Lake Formation is as simple as defining data sources and what data access and security policies you want to apply. 

Redshift-internal-architecture-

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS, and a popular choice for business intelligence and reporting use cases (see What Is Redshift Used For?

You might already be familiar with Redshift basics – but in this article, we’ll dive a bit deeper to cover Redshift’s internal system design and how it fits into broader data lake and data warehouse architectures. Understanding these factors will help you reap the most benefits from your deployment while controlling your costs.

Redshift Architecture and Main Components

Redshift internal architecture

As with other relational databases, storage and compute in Redshift are coupled. Data from applications, files, and cloud storage can be loaded into the data warehouse using either native AWS services such as Amazon Appflow or through a variety of 3rd party apps such as Fivetran and Matillion. Many of these tools would also provide ELT capabilities to further cleanse, transform, and aggregate data after it has been loaded into Redshift.
Zooming in on the internal architecture, we can see that a Redshift cluster is composed of a leader node, and compute nodes that are divided into node slices, and databases. This design allows Redshift to dynamically allocate resources in order to efficiently answer queries.

Breaking Down the Redshift Cluster Components

  • The leader node is Redshift’s ‘brain’ and manages communications with external client programs as well as the internal communication between compute nodes. When a query is made, the leader node will parse it, compile the code and create an execution plan.
  • Compute nodes provide the ‘muscle’ – the physical resources required to perform the requested database operation. This is also where the data is actually stored. Each compute node has dedicated CPU, RAM and storage, and these differ according to the node type.
  • The execution plan distributes the workload between compute nodes, which process the data in parallel. The workload is further distributed within the node: each node is partitioned into node slices, and each node slice is allocated a portion of the compute node’s memory and disk, according to the amount of data it needs to crunch.
  • Intermediate results are sent back to the leader node, which performs the final aggregation and sends the results to client applications via ODBC or JDBC. These would frequently be reporting and visualization tools such as Tableau or Amazon Quicksight, or internal software applications that read data from Redshift.
  • Redshift’s Internal Network provides high-speed communication between the nodes within the cluster.
  • Each Redshift cluster can contain multiple databases, with resources dynamically allocated between them.

This AWS presentation offers more details about Redshift’s internal architecture, and a step-by-step breakdown of how queries are handled in Redshift and Redshift Spectrum:

Additional Performance Features

In addition to these core components, Redshift has multiple built-in features meant to improve performance:

  • Columnar storage: Redshift stores data in a column-oriented format rather than the row-based storage of traditional OLTP databases. This allows for more efficient compression and indexing.
  • Concurrency scaling: When a cluster receives a large number of requests, Redshift can automatically add resources to maintain consistent performance in read and write operations. 
  • Massively Parallel Processing (MPP): As described above, multiple compute nodes work on portions of the same query at the same time, ensuring final aggregations are returned faster.
  • Query optimizer: Redshift applies query optimizations that leverage its MPP capabilities and columnar data storage. This helps Redshift process complex SQL queries that could include multi-table joins and subqueries. 
  • Result caching: The results of certain types of queries can be stored in-memory on the leader node, which can also reduce query execution time..

Redshift vs Traditional Data Warehouses

While Redshift can replace many of the functions filled by ‘traditional’ data warehouses such as Oracle and Teradata, there are a few key differences to keep in mind:

  • Managed infrastructure: Redshift infrastructure is fully managed by AWS rather than its end users – including hardware provisioning, software patching, setup, configuration, monitoring nodes and drives, and backups.
  • Optimized for analytics: While Redshift is a relational database management system (RDBMS) based on PostgreSQL and supports standard SQL, it is optimized for analytics and reporting rather than transactional features that require very fast retrieval or updates of specific records.
  • Serverless capabilities: Introduced in 2018, Redshift serverless can be used to automatically provision compute resources after a specific SQL query is made, further abstracting infrastructure management by removing the need to size your cluster in advance.

Redshift Costs and Performance

Amazon Redshift pricing can get complicated and depends on many factors, so a full breakdown is beyond the scope of this article. There are three basic types of pricing models for Redshift usage:

  • On-demand instances are charged by the hour, with no long-term commitment or upfront fees. 
  • Reserved instances offer a discount for customers who are willing to commit to using Redshift for a longer period of time. 
  • Serverless instances are charged based on usage, so customers only pay for the capacity they consume.

The size of your dataset and the level of performance you need from Redshift will often dictate your costs. Unlike object stores such as Amazon S3, scaling storage is non-trivial from a cost perspective (due to Redshift’s coupled architecture). When implementing use cases that require granular historical datasets you might find yourself paying for very large clusters. 

Performance depends on the number of nodes in the cluster and the type of node – you can pay for more resources to guarantee better performance. Other pertinent factors are the distribution of data, the sort order of data, and the structure of the query. 

Finally, you should bear in mind that Redshift compiles code the first time a query is run, meaning queries might run faster from the second time onwards – making it more cost-effective for situations where the queries are more predictable (such as a BI dashboard that updates every day) rather than exploratory ad-hoc analysis.

Reducing Redshift Costs with a Lakehouse Architecture

We’ve worked with many companies who started out using Redshift when they didn’t have much data but found it difficult and costly to scale as their needs evolved. 

Companies can face rapid growth in data when they acquire more users, introduce new business systems, or simply want to perform deeper exploratory analysis that requires more granular datasets and longer data retention periods. With Redshift’s coupling of storage and compute, this can cause their costs to scale almost linearly with the size of their data.

At this stage, it makes sense to consider moving from a data warehouse architecture to a data lakehouse to leverage inexpensive storage on Amazon S3 while distributing ETL and SQL query workloads between multiple services.

Redshift Lakehouse Architecture

In this architecture, companies can continue to use Redshift for workloads that require consistent performance such as dashboard reporting, while leveraging best-in-class frameworks such as open-source Presto to run queries directly against Amazon S3. This allows organizations to analyze much more data – without having to constantly up or downsize their Redshift clusters, manage complex retention policies, or deal with unmanageable costs.
To learn more about what considerations you should be thinking about as you look at data warehouses or data lakes, check out this white paper by Ventana Research: Unlocking the Value of the Data Lake.

data-warehouse-or-data-lake-on-demand

Webinar On-Demand
Data Warehouse or Data Lake, which one do I choose?

(Hosted by Dataversity)

Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization.


Enter the Open Data Lakehouse, the next evolution of the data stack that gives you the openness and flexibility of the data lake with the key aspects of the data warehouse like management and transaction support.


In this webinar, you’ll hear from Ali LeClerc who will discuss the data landscape and why many companies are moving to an open data lakehouse. Ali will share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lakehouse.

Speaker

Ali LeClerc

Head of Community, Ahana

presto-query-analyzer-logo

Ahana Announces New Presto Query Analyzer to Bring Instant Insights into Presto Clusters

Free-to-use Presto Query Analyzer by Ahana enables data platform teams to analyze Presto workloads and ensure top performance

San Mateo, Calif. – May 18, 2022 – Ahana, the only SaaS for Presto, today announced a new tool for Presto users called the Presto Query Analyzer. With the Presto Query Analyzer, data platform teams can get instant insights into their Presto clusters including query performance, bandwidth bottlenecks, and much more. The Presto Query Analyzer was built for the Presto community and is free to use.

Presto has become the SQL query engine of choice for the open data lakehouse. The open data lakehouse brings the reliability and performance of the data warehouse together with the flexibility and simplicity of the data lake, enabling data warehouse workloads to run alongside machine learning workloads. Presto on the open data lakehouse enables much better price performance as compared to expensive data warehousing solutions. As more companies are moving to an open data lakehouse approach with Presto as its engine, having more insights into query performance, workloads, resource consumption, and much more is critical.

“We built the Presto Query Analyzer to help data platform teams get deeper insights into their Presto clusters, and we are thrilled to be making this tool freely available to the broader Presto community,” said Steven Mih, Cofounder & CEO, Ahana. “As we see the growth and adoption of Presto continue to skyrocket, our mission is to help Presto users get started and be successful with the open source project. The Presto Query Analyzer will help teams get even more out of their Presto usage, and we look forward to doing even more for the community in the upcoming months.”

Key benefits of the Presto Query Analyzer include:

  • Understand query workloads: Break down queries by operators, CPU time, memory consumption, and bandwidth. Easily cross-reference queries for deep drill down.
  • Identify popular data: See which catalog, schema, tables, and columns are most and least frequently used and by who.
  • Monitor research consumption: Track CPU and memory utilization across the users in a cluster.

The Presto Query Analyzer by Ahana is free to download and use. Download it to get started today.

More Resources

Free Download: Presto Query Analyzer by Ahana

Presto Query Analyzer sample report

Tweet this:  @AhanaIO announces #free #Presto Query Analyzer for instant insights into your Presto clusters https://bit.ly/3lo2rMM

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

ahana logo

What is Amazon Redshift Used For?

Introduction

Amazon Redshift is one of the most widely-used services in the AWS ecosystem and is a familiar component in many cloud architectures. In this article, we’ll cover the key facts you need to know about this cloud data warehouse and the use cases it is best suited for, as well as limitations and scenarios where you might want to consider alternatives.

What is Amazon Redshift?

Amazon Redshift is a fully managed cloud data warehouse offered by AWS. First introduced in 2012, today Redshift is used by thousands of customers, typically for workloads ranging from hundreds of gigabytes to petabytes of data.

Redshift is based on PostgreSQL 8.0.2 and supports standard SQL for database operations. Under the hood, various optimizations are implemented to provide fast performance even at larger data scales, including massively parallel processing (MPP) and read-optimized columnar storage.

What is a Redshift Cluster?

A Redshift cluster represents a group of nodes provisioned as resources for a specific data warehouse. Each cluster consists of a leader and compute nodes. When a query is executed, Redshift’s MPP design means it distributes the processing power needed to return the results of an SQL query between the available nodes automatically.

Determining cluster size depends on the amount of data stored in your database, the number of queries being executed, and the desired performance. 

Scaling and managing clusters can be done through the Redshift console, the AWS CLI, or programmatically through the Redshift Query API.

What Makes Redshift Unique?

When Redshift was first launched, it represented a true paradigm shift from traditional data warehouses provided by the likes of Oracle and Teradata. As a fully managed service, Redshift allowed development teams to shift their focus away from infrastructure and toward core application development. The ability to add compute resources automatically with just a few clicks or lines of code, rather than having to set up and configure hardware, was revolutionary and allowed for much faster application development cycles.

Today, many modern cloud data warehouses offer similar linear scaling and infrastructure-as-a-service functionality – with a few notable products including Snowflake and Google BigQuery. However, Redshift remains a very popular choice and is tightly integrated with other services in the AWS cloud ecosystem.

Amazon continues to improve Redshift, and in recent years has introduced federated query capabilities, serverless, and AQUA (hardware accelerated cache).

Redshift Use Cases

Redshift’s Postgres roots mean it is optimized for online analytical processing (OLAP) and business intelligence (BI) – typically executing complex SQL queries on large volumes of data rather than transactional processing which focuses on efficiently retrieving and manipulating a single row.

Some common use cases for Redshift include:

  • Enterprise data warehouse: Even smaller organizations often work with data from multiple sources such as advertising, CRM, and customer support. Redshift can be used as a centralized repository that stores data from different sources in a unified schema and structure to create a single source of truth, which can then feed enterprise-wide reporting and analytics.
  • BI and analytics: Redshift’s fast query execution against terabyte-scale data makes it an excellent choice for business intelligence use cases, and it is often used as the underlying database for BI tools such as Tableau (which otherwise might struggle to perform when querying or joining larger datasets).
  • Embedded analytics and analytics as a service: Some organizations might choose to monetize the data they collect by exposing it to customers. Redshift’s data sharing, search and aggregation capabilities make it viable for these scenarios, as it allows exposing only relevant subsets of data per customer while ensuring other databases, tables, or rows remain secure and private.
  • Production workloads: Redshift’s performance is consistent and predictable, as long as the cluster is adequately-resourced. This makes it a popular choice for data-driven applications, which might use data for reporting or perform calculations on it.
  • Change data capture and database migration: AWS Database Migration Service (DMS) can be used to replicate changes in an operational data store into Amazon Redshift. This is typically done to provide more flexible analytical capabilities, or when migrating from legacy data warehouses.

Redshift Challenges and Limitations 

While Amazon Redshift is a powerful and versatile data warehouse, it still suffers from the limitations of any relational database, including:

  • Costs: Since storage and compute are coupled, Redshift costs can quickly grow very high when working with larger datasets, or with streaming sources such as application logs.
  • Complex data ingestion: Unlike Amazon S3, Redshift does not support unstructured object storage. Data needs to be stored in tables with predefined schemas, which can often require complex ETL or ELT processes to be performed when data is written to Redshift. 
  • Access to historical data: Due to the above limiting factors, most organizations choose to store only a subset of raw data in Redshift, or limit the number of historical versions of the data that they retain. 
  • Vendor lock-in: Migrating data between relational databases is always a challenge due to the rigid schema and file formats used by each vendor. This can create significant vendor lock-in and make it difficult to use other tools to analyze or access data.

Due to these limitations, Redshift is often a less than ideal solution for use cases that require diverse access to very large volumes of data, such as exploratory data science and machine learning. In these cases, many organizations would gravitate towards storing the data on Amazon S3 in a data lakehouse architecture.

If your organization is struggling to accommodate advanced Redshift use cases, or managing increasing cloud storage costs, check out Ahana – a powerful managed service for Presto which provides SQL on S3. 

Want to learn from a real-life example? See how Blinkit cut their data delivery time from 24 hours to 10 minutes by moving from Redshift to Ahana – watch the case study here.

ETL process diagram

ETL vs ELT in Data Warehousing

Introduction

ETL, or Extract Transform Load, is when an ETL tool or series of homegrown programs extracts data from a data source(s), often a relational database, and performs transformation functions. Those transformations could be data cleansing, standardizations, enrichment, etc., and then write (load) that data into a new repository, often a data warehouse. 

In the ETL process, an ETL tool or series of programs extracts the data from different RDBMS source systems, and then transforms the data, by applying calculations, concatenations, etc., and then loads the data into the Data Warehouse system.

ETL process diagram

ELT, or Extract Load Transform turns the ETL process around a little bit and has you extract the raw data out from the data source and directly load it into the destination, without any processing in between. The transformation process is then done “in place” in the destination repository. Generally, the raw data is stored indefinitely so various transformations and enrichments can all be done by users with access to it, using tools they are familiar with.

ELT, or Extract Load Transform

Both are data integration styles and have much in common with their ultimate goals, but are implemented very differently.

The Difference Between ETL and ELT

So how does ETL vs ELT break down?

CategoryETLELT
DefinitionData is extracted from ‘n’ number of data sources. Transformed in a separate process, then loaded into the destination repository.Data is extracted from ‘n’ number of data sources and directly loaded into the destination repository. Transformation occurs inside the destination.
TransformationData is transformed within an intermediate processing step that is independent of extract and load.Data can be transformed on an ad-hoc basis during reads, or in batch and stored in another set of tables.
Code-Based TransformationsPrimarily executed in the compute-intensive transformation process.Primarily executed in the database but also done ad-hoc through analysis tools.
Data Lake SupportOnly in the sense that it can be utilized as storage for the transformation step.Well oriented for the data lake.
CostSpecialized servers for transformation can add significant costs.Object stores are very inexpensive, requiring no specialized servers.
MaintenanceAdditional servers add to the overall maintenance burden.Fewer systems mean less to maintain.
LoadingData has to be transformed prior to loading. Data is loaded directly into the destination system.
MaturityETL tools and methods have been around for decades and are well understood.Relatively new on the scene, with emerging standards and less experience.

Use Cases

Let’s take HIPAA as an example of data that would lend itself to ETL rather than ELT. The raw HIPAA data contains a lot of sensitive information about patients that isn’t allowed to be shared, so you would need to go through the transformation process prior to loading it to remove any of that sensitive information. Say your analysts were trying to track cancer treatments for different types of cancer across a geographic region. You would scrub your data down in the transformation process to include treatment dates, location, cancer type, age, gender, etc., but remove any identifying information about the patient.

An ELT approach makes more sense with a data lake where you have lots of structured, semi-structured, and unstructured data. This can also include high-velocity data where you are trying to make decisions in near real-time. Consider an MMORPG where you want to offer incentives to players in a particular region that have performed a particular task. That data is probably coming in through a streaming protocol such as Kafka and analysts are doing transforming jobs on the fly to distill it down to the necessary information to fuel the desired action.

Summary

In summary, the difference between ETL and ELT in data warehousing really comes down to how you are going to use the data as illustrated above. They satisfy very different use cases and require thoughtful planning and a good understanding of your environment and goals. If you’re exploring whether to use a data warehouse or a data lake, we have some resources that might be helpful. Check out our white paper on Unlocking the Business Value of the Data Lake which discusses the data lake approach in comparison to the data warehouse. 

Understanding AWS Athena Costs with Examples

What Is Amazon Athena? 

Since you’re reading this to understand Athena costs, you likely already know, so we’ll just very briefly touch on what it is. Amazon Athena is a managed serverless version of Presto. It provides a SQL query engine for analyzing unstructured data in AWS S3. The best use case is where reliable speed and scalability are not particularly important, meaning that, since there are no dedicated resources for the service, it will not perform in a consistent fashion. So, testing ideas, small use cases and quick ad-hoc analysis are where it makes the most sense.

How Much Does AWS Athena Cost?

An Athena query costs from $5 to $7 per terabyte scanned, depending on the region. Most materials you read will only quote the $5, but there are regions that cost $7, so keep that in mind. For our examples, we’ll use the $5 per terabyte as our base. There are no costs for failed queries, but any other charges such as the S3 storage will apply as usual for any service you are using.

AWS Athena Pricing Example

In this example, we have a screenshot from the Amazon Athena pricing calculator where we are assuming 1 query per work day per month, so 20 queries a month, that would scan 4TB of data. The cost per query works out as follows:

$5 per TB scanned * 4 TB scanned = $20 per query

So if we are doing that query 20 times per month, then we have 20 * $20 = $400 per month 

Service settings_Amazon Athena

You can mitigate these costs by storing your data compressed, if that is an option for you. A very conservative 2:1 compression rate would cut your costs in half to just $200 per month. Now, if you were to store your data in a columnar format like ORC or Parquet, then you can reduce your costs even further by only scanning the columns you need, instead of the entire row every time. We’ll use the same 50% notion where we now only have to look at half our data, and now our cost is down to $100 per month.

Let’s go ahead and try a larger example, and not even a crazy big one if you are using the data lake and doing serious processing. Let’s say you have 20 queries per day, and you are working on 100TB of uncompressed, row based data:

pricing calculator

That’s right, $304,000 per month. Twenty queries per day isn’t even unrealistic if you have some departments that are wanting to run some dashboard queries to get updates on various metrics. 

Summary

While we learned details about Athena pricing, we also saw how easy it would be to get hit with a giant bill unexpectedly. If you haven’t compressed your data, or reformatted it to reduce those costs and just dumped a bunch of CSV or JSON files into S3, then you can have a nasty surprise. If you unleash connections to Athena to your data consumers without any controls, you can also end up with some nasty surprises if they are firing off a lot of queries on a lot of data. It’s not hard to figure out what the cost will be for specific usage, and Amazon has provided the tools to do it.

If you’re an Athena user who’s not happy with costs, you’re not alone. We see many Athena users wanting more control over their deployment and in turn, costs. That’s where we can help – Ahana is SaaS for Presto (the same technology that Athena is running) that gives you more control over your deployment. Typically our customers see up to 5.5X price performance improvements on their queries as compared to Athena. 

You can learn more about how Ahana compares to AWS Athena in this comparison page.

The next EDW is the Open Data Lakehouse

5 Components of Data Warehouse Architecture

The Data Warehouse has been around for decades. Born in the 1980s, it addressed the need for optimized analytics on data. As companies’ business applications began to grow and generate/store more data, they needed a system that could both manage the data and analyze it. At a high level, database admins could pull data from their operational systems and add a schema to it via transformation before loading it into their data warehouse (this process is also known as ETL – Extract, Transform, Load). Schema is made up of metadata (data about the data) so users could easily find what they were looking for. The data warehouse could also connect to many different data sources, so it became an easier way to manage all of a company’s data for analysis.

As the data warehouse grew in popularity, more people within a company started using it to access data – and the data warehouse made it easy to do so with structured data. This is where metadata became important. Reporting and dashboarding became a key use case, and SQL (structured query language) became the de facto way of interacting with that data.

Here’s a quick high level architecture of the data warehouse:

Architecture of Data Warehouse
In this article we’ll look at the contextual requirements of a data warehouse, which are the five components of a data warehouse. 

Those include:
  • ETL
  • Metadata
  • SQL Query Processing
  • Data layer
  • Governance/security

ETL 

As mentioned above, ETL stands for Extract, Transform, Load. When DBAs want to move data from a data source into their data warehouse, this is the process they use. In short, ETL converts data into a usable format so that once it’s in the data warehouse, it can be analyzed/queried/etc. For the purposes of this article, I won’t go into too much detail of how the entire ETL process works, but there are many different resources where you can learn about ETL.

Metadata

Metadata is data about data. Basically, it describes all of the data that’s stored in a system to make it searchable. Some examples of metadata include authors, dates, or locations of an article, create date of a file, the size of a file, etc. Think of it like the titles of a column in a spreadsheet. Metadata allows you to organize your data to make it usable, so you can analyze it to create dashboards and reports.

SQL Query Processing

SQL is the de facto standard language for querying your data. This is the language that analysts use to pull out insights from their data stored in the data warehouse. Typically data warehouses have proprietary SQL query processing technologies tightly coupled with the compute. This allows for very high performance when it comes to your analytics. One thing to note, however, is that the cost of a data warehouse can start getting expensive the more data and SQL compute resources you have.

Data Layer

The data layer is the access layer that allows users to actually get to the data. This is typically where you’d find a data mart. This layer partitions segments of your data out depending on who you want to give access to, so you can get very granular across your organization. For instance, you may not want to give your Sales team access to your HR team’s data, and vice versa.

Governance/Security

This is related to the data layer in that you need to be able to provide fine grained access and security policies across all of your organization’s data. Typically data warehouses have very good governance and security capabilities built in, so you don’t need to do a lot of custom engineering work to include this. It’s important to plan for governance and security as you add more data to your warehouse and as your company grows.

The Challenges with a Data Warehouse

Now that I’ve laid out the five key components of a data warehouse, let’s discuss some of the challenges of the data warehouse. As companies start housing more data and needing more advanced analytics and a wide range of data, the data warehouse starts to become expensive and not so flexible. If you want to analyze unstructured or semi-structured data, the data warehouse won’t work. 

We’re seeing more companies moving to the Data Lakehouse architecture, which helps to address the above. The Data Lakehouse allows you to run warehouse workloads on all kinds of data in an open and flexible architecture. Instead of a tightly coupled system, the Data Lakehouse is much more flexible and also can manage unstructured and semi-structured data like photos, videos, IoT data, and more. Here’s what that architecture looks like:

EDW is the open data lakehouse diagram

The Data Lakehouse can also support your data science, ML and AI workloads in addition to your reporting and dashboarding workloads.
If you’re interested in learning more about why companies are moving from the data warehouse to the data lakehouse, check out this free whitepaper on how to Unlock the Business Value of the Data Lake/Data Lakehouse.

webinar-introduction-toahana-cloud-on-aws-ondemand

Webinar On-Demand
An introduction to Ahana Cloud for Presto on AWS

The Open Data Lakehouse brings the reliability and performance of the Data Warehouse together with the flexibility and simplicity of the Data Lake, enabling data warehouse workloads to run on the data lake. At the heart of the data lakehouse is Presto – the open source SQL query engine for ad hoc analytics on your dataDuring this webinar we will share how to build an open data lakehouse with Presto and AWS S3 using Ahana Cloud.

Presto, the fast-growing open source SQL query engine, disaggregates storage and compute and leverages all data within an organization for data-driven decision making. It is driving the rise of Amazon S3-based data lakes and on-demand cloud computing. Ahana is a managed service for Presto that gives data platform teams of all sizes the power of SQL for their data lakehouse.

In this webinar we will cover:

  • What an Open Data Lakehouse is
  • How you can use Presto to underpin the lakehouse in AWS
  • A demo on how to get started building your Open Data Lakehouse in AWS

Speaker

Shawn Gordon

Sr. Developer Advocate, Ahana

Shawn Gordon Headshot

ahana logo

Enterprise Data Lake Formation & Architecture on AWS

What is Enterprise Data Lake

An enterprise data lake is simply a data lake for enterprise-wide information sharing and storing of data. The key purpose of “Enterprise data lake” is to incorporate analytics on it to unlock business insights from the stored data.

Why AWS Lake formation for Enterprise Data Lake

The key purpose of “Enterprise Data Lake” is to run analytics to gain business insights. As part of that process,governance of data becomes more important to secure the access of data between different roles in the enterprise. AWS Lake Formation is a service that makes it easy to set up a secure data lake very quickly (in a matter of days), providing a governance layer for data lakes on AWS S3. 

Enterprise Data Lake Formation & Architecture on AWS

Enterprise data platforms need a simpler, scalable, and centralized way to define and enforce access policies on their data lakes . A policy based approach to allow their data lake consumers to use the analytics service of their choice, to best suit the operations they want to perform on the data. Although the existing method of using Amazon S3 bucket policies to manage access control is an option, when the number of combinations of access levels and users increase, it may not be an option for enterprises.


AWS Lake Formation allows enterprises to simplify and centralize access management. It allows organizations to manage access control for Amazon S3-based data lakes using familiar concepts of databases, tables, and columns (with more advanced options like row and cell-level security). 

Benefits of AWS Lake formation for Enterprise Data Lakes

  •  One schema – shareable with no dependency on architecture
  •  Share AWS Lake Formation databases and tables to any AWS accounts 
  •  No Amazon S3 policy edits required 
  •  Receivers of the data can use analytic service provider like Ahana to run analytics.
  •  There is no dependency between roles on how the data will be further shared.
  •  Centralized logging.

AWS Enterprise Lake Formation: To Summarize

AWS Lake Formation has been integrated with AWS partners like Ahana cloud, a managed service for SQL on data lakes. These services honor the Lake Formation permissions model out of the box, which makes it easy for customers to simplify, standardize, and scale data security management for data lakes.

Webinar On-Demand
How to build an Open Data Lakehouse Analytics stack

As more companies are leveraging the Data Lake to run their warehouse workloads, we’re seeing many companies move to an Open Data Lakehouse stack. The Open Data Lakehouse brings the reliability and performance of the Data Warehouse together with the flexibility and simplicity the Data Lake, enabling data warehouse workloads to run on the data lake.

Join us for this webinar where we’ll show you how you can build an open data lakehouse stack. At the heart of this stack is Presto, the open source SQL query engine for the data lake, and the transaction manager / governance layer, which includes technologies like Apache Hudi, Delta Lake, and AWS Lake Formation.

You’ll Learn:

  • What an Open Data Lakehouse Analytics Stack is
  • How Presto, the de facto query engine for the data lakehouse, underpins that stack
  • How to get started building your open data lakehouse analytics stack today

Speaker

Shawn Gordon

Sr. Developer Advocate, Ahana

Shawn Gordon Headshot

ahana logo

How to Query Your JSON Data Using Amazon Athena

AWS Athena is Amazon’s serverless implementation of Presto, which means they generally have the same features. A popular use case is to use Athena to query Parquet, ORC, CSV and JSON files that are typically used for querying directly, or transformed and loaded into a data warehouse. Athena allows you to extract data from, and search for values and parse JSON data.

Using Athena to Query Nested JSON

To have Athena query nested JSON, we just need to follow some basic steps. In this example, we will use a “key=value” to query a nested value in a JSON. Consider the following AWS Athena JSON example:

[
  {
    "name": "Sam",
    "age": 45,
    "cars": {
      "car1": {
        "make": "Honda"
      },
      "car2": {
        "make": "Toyota"
      },
      "car3": {
        "make": "Kia"
      }
    }
  },
  {
    "name": "Sally",
    "age": 21,
    "cars": {
      "car1": {
        "make": "Ford"
      },
      "car2": {
        "make": "SAAB"
      },
      "car3": {
        "make": "Kia"
      }
    }
  },
  {
    "name": "Bill",
    "age": 68,
    "cars": {
      "car1": {
        "make": "Honda"
      },
      "car2": {
        "make": "Porsche"
      },
      "car3": {
        "make": "Kia"
      }
    }
  }
]

We want to retrieve all “name”, “age” and “car2” values out of the array:

SELECT name, age, cars.car2.make FROM the_table; 
name age cars.car2
Sam45 Toyota
Sally21 SAAB
Bill68 Porsche

That is a pretty simple use case of  retrieving certain fields out of the JSON. The complexity was the cars column with the key/value pairs and we needed to identify which field we wanted. Nested values in a JSON can be represented as “key=value”, “array of values” or “array of key=value” expressions. We’ll illustrate the latter two next.

How to Query a JSON Array with Athena

Abbreviating our previous example to illustrate how to query an array, we’ll use a car dealership and car models, such as:

{
	"dealership": "Family Honda",
	"models": [ "Civic", "Accord", "Odyssey", "Brio", "Pilot"]
}

We have to unnest the array and connect it to the original table:

SELECT dealership, cars FROM dataset
CROSS JOIN UNNEST(models) as t(cars)
dealershipmodels
Family Honda Civic
Family HondaAccord
Family HondaOdyssey
Family HondaBrio
Family HondaPilot

Finally we will show how to query nested JSON with an array of key values.

Query Nested JSON with an Array of Key Values

Continuing with the car metaphor, we’ll consider a dealership and the employees in an array:

dealership:= Family Honda

employee:= [{name=Allan, dept=service, age=45},{name=Bill, dept=sales, age=52},{name=Karen, dept=finance, age=32},{name=Terry, dept=admin, age=27}]

To query that data, we have to first unnest the array and then select the column we are interested in. Similar to the previous example, we will cross join the unnested column and then unnest it:

select dealership, employee_unnested from dataset
cross join unnest(dataset.employee) as t(employee2)
dealershipemployee_unnested
Family Honda {name=Allan, dept=service, age=45}
Family Honda{name=Bill, dept=sales, age=52}
Family Honda{name=Karen, dept=finance, age=32}
Family Honda{name=Terry, dept=admin, age=27}

By using the “.key”, we can now retrieve a specific column:

select dealership,employee_unnested.name,employee_unnested.dept,employe_unnested.age from dataset
cross join unnest(dataset.employee) as t(employee_unnested)
dealershipemployee_unnested.nameemployee_unnested.deptemployee_unnested.age
Family HondaAllenservice45
Family HondaBillsales52
Family HondaKarenfinance32
Family HondaTerryadmin27

Using these building blocks, you can start to test on your own JSON files using Athena to see what is possible. Athena, however, runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous Athena users that saw challenges around price performance and concurrency/deployment control. Keep in mind, Athena costs from $5 to around $7 dollars per terabyte scanned cost, depending on the region. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 


You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/

Webinar On-Demand
Unlocking the Business Value of the Data Lake  

As more companies are moving to the cloud, choosing how and where to store their data remains a critical decision. While the data lake has quickly become a popular choice, the challenge lies in getting business value out of that data lake. To solve for that, we’re seeing a modern, open data lake analytics stack emerge. This stack includes open source, open formats, and open clouds, giving companies flexibility at each layer so they can harness the full potential of their data lake data.

During this webinar we’ll discuss how nearly three-fifths of organizations have gained competitive advantage from their data lake initiatives. That includes unleashing the intelligence-generating potential of a data lake that enables ad hoc data discovery and analytics in an open and flexible manner. We’ll cover:

  • The primary approaches for building an open data lake analytics stack, including where and how the data warehouse fits
  • The business benefits enabled by the technical advantages of this open data lake analytics stack
  • Why structured data processing and analytics accelerating capabilities are critical

Speakers

Matt Aslett

VP & Research Director, Ventana Research

Matt-Aslett-Ventana-1

Wen Phan

Director of Product, Ahana

Wen-Phan_Picture

What is an Open Data Lake in the Cloud?

Data Driven Insights diagram

Problems that necessitate a data lake

In today’s competitive landscape, companies are increasingly leveraging their data to make better decisions, providing value to their customers, and improving their operations. Data-driven insights can help business and product leaders hone in on customer needs and/or find untapped opportunities. Also, analytics dashboards can be presented to customers for added value. Traditionally, insights are gleaned from rather small amounts of enterprise data which is what you’d expect – historical information about products, customers, and sales. But now, the modern business must deal with 1000s of times more data, which encompasses more types of data and is far beyond Enterprise Data. Examples include 3rd party data feeds, IoT sensor data, event data, geospatial and other telemetry data.

The problem with having 1000s of times the data is that databases, and specifically data warehouses, can be very expensive. And data warehouses are optimized to handle relational data with a well-defined structure and schema. As both data volumes and usage grow, the costs of a data warehouse can easily spiral out of control. Those costs, coupled with the inherent lock-in associated with data warehouses, have left many companies looking for a better solution, either augmenting their enterprise data warehouse or moving away from them altogether. 

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture. 

The Open Data Lake in the cloud centers on S3-based object storage. In AWS, there can be many S3-buckets across an organization. In Google Cloud, there is a service called Google Cloud Store (GCS) and in Microsoft Azure it is called Azure blob store. The data lake can store the relational data that typically comes from business apps like the data warehouse stores. But the data lake also stores non-relational data from a variety of sources as mentioned above. The data lake can store structured, semi-structured, and/or unstructured data.

With all this data stored in the data lake, companies can run different types of analytics directly, such as SQL queries, real-time analytics, and AI/Machine Learning. A metadata catalog of the data enables the analytics of the non-relational data. 

Why Open for Data Lakes

As mentioned, companies have the flexibility to run different types of analytics, using different analytics engines and frameworks. Storing the data in open formats is the best-practice for companies looking to avoid the lock-in of the traditional cloud data warehouse. The most common formats of a modern data infrastructure are open, such as Apache Parquet and ORC. They are designed for fast analytics and are independent of any platform. Once data is in an open format like Parquet, it would follow to run open source engines like Presto on it. Ahana Cloud is a Presto managed service which makes it easy, secure, and cost efficient to run SQL on the Open Data Lake. 

If you want to learn more about why you should be thinking about building an Open Data Lake in the cloud, check out our free whitepaper on Unlocking the Business Value of the Data Lake – how open and flexible cloud services help provide value from data lakes.

Helpful Links

Best Practices for Resource Management in PrestoDB

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

5 main reasons Data Engineers move from AWS Athena to Ahana Cloud

ahana logo

AWS Athena vs AWS Glue: What Are The Differences?

Amazon’s AWS platform has over 200 products and services, which can make understanding what each one does and how they relate confusing. Here, we are going to talk about AWS Athena vs Glue, which is an interesting pairing as they are both complementary and competitive. So, what are they exactly?

What is AWS Athena?

AWS Athena is a serverless implementation of Presto. Presto is an interactive query service that allows you to query structured or unstructured data straight out of S3 buckets.

What is AWS Glue?

AWS Glue is also serverless, but more of an ecosystem of tools to allow you to easily do schema discovery and ETL with auto-generated scripts that can be modified either visually or via editing the script. The most commonly known components of Glue are Glue Metastore and Glue ETL. Glue Metastore is a serverless hive compatible metastore which can be used in lieu of your own managed Hive. Glue ETL on the other hand is a Spark service which allows customers to run Spark jobs without worrying about the configuration, manageability and operationalization of the underlying Spark infrastructure. There are other services such as Glue Data Wrangler which we will keep outside the scope of this discussion.

AWS Athena vs AWS Glue

Where this turns from AWS Glue vs AWS Athena to AWS Glue working with Athena is with the Glue Catalog. The Glue catalog is used as a central hive-compatible metadata catalog for your data in AWS S3. It can be used across AWS services – Glue ETL, Athena, EMR, Lake formation, AI/ML etc. A key difference between Glue and Athena is that Athena is primarily used as a query tool for analytics and Glue is more of a transformation and data movement tool.

Some examples of how Glue and Athena can work together would be:

CREATE EXTERNAL TABLE sampleTable (
  col1 INT,
  col2 INT,
  str1 STRING,
  ) STORED AS AVRO
  TBLPROPERTIES (
  'classification'='avro')
  • Creating tables for Glue to use in ETL jobs. The table must have a property added to them called a classification, which identifies the format of the data. The classification values can be csv, parquet, orc, avro, or json. An example CREATE TABLE statement in Athena would be:
  • Transforming data into a format that is better optimized for query performance in Athena, which will also impact cost as well. So, converting a CSV or JSON file into Parquet for example.

Query S3 Using Athena & Glue

Now how about querying S3 data utilizing both Athena and Glue? There are a few steps to set it up, first, we’ll assume a simple CSV file with IoT data in it, such as:

We would first upload our data to an S3 bucket, and then initiate a Glue crawler job to infer the schema and make it available in the Glue catalog. We can now use Athena to perform SQL queries on this data. Let’s say we want to retrieve all rows where ‘att2’ is ‘Z’, the query looks like this:

SELECT * FROM my_table WHERE att2 = 'Z';

From here, you can perform any query you want, you can even use Glue to transform the source CSV file into a Parquet file and use the same SQL statement to read the data. You are insulated from the details of the backend as a data analyst using Athena, while the data engineers can optimize the source data for speed and cost using Glue.

AWS Athena is a great place to start if you are just getting started on the cloud and want to test the waters at low cost and minimal effort. Athena however quickly runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs will increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous Athena users that saw challenges around price performance and concurrency/deployment control. Ahana is also tightly integrated with the Glue metastore, making it simple to map and query your data. Keep in mind that Athena costs $5 per terabyte scanned cost. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 

You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/ 

Best Practices for Resource Management in PrestoDB

ahana logo

Resource management in databases allows administrators to have control over resources and assign a priority to sessions, ensuring the most important transactions get the major share of system resources. Resource management in a distributed environment makes accessibility of data easier and manages resources over the network of autonomous computers (i.e. Distributed System). The basis of resource management in the distributed system is also resource sharing.

PrestoDB is a distributed query engine written by Facebook as the successor to Hive for highly scalable processing of large volumes of data. Written for the Hadoop ecosystem, PrestoDB is built to scale to tens of thousands of nodes and process petabytes of data. In order to be usable at a production scale, PrestoDB was built to serve thousands of queries to multiple users without facing bottle-necking and “noisy neighbor” issues. PrestoDB makes use of resource groups in order to organize how different workloads are prioritized. This post discusses some of the paradigms that PrestoDB introduces with resource groups as well as best practices and considerations to think about before setting up a production system with resource grouping.

Getting Started

Presto has multiple “resources” that it can manage resource quotas for. The two main resources are CPU and memory. Additionally, there are granular resource constraints that can be specified such as concurrency, time, and cpuTime. All of this is done via a pretty ugly JSON configuration file shown in the  example below from the PrestoDB doc pages.

{
  "rootGroups": [
    {
      "name": "global",
      "softMemoryLimit": "80%",
      "hardConcurrencyLimit": 100,
      "maxQueued": 1000,
      "schedulingPolicy": "weighted",
      "jmxExport": true,
      "subGroups": [
        {
          "name": "data_definition",
          "softMemoryLimit": "10%",
          "hardConcurrencyLimit": 5,
          "maxQueued": 100,
          "schedulingWeight": 1
        },
        {
          "name": "adhoc",
          "softMemoryLimit": "10%",
          "hardConcurrencyLimit": 50,
          "maxQueued": 1,
          "schedulingWeight": 10,
          "subGroups": [
            {
              "name": "other",
              "softMemoryLimit": "10%",
              "hardConcurrencyLimit": 2,
              "maxQueued": 1,
              "schedulingWeight": 10,
              "schedulingPolicy": "weighted_fair",
              "subGroups": [
                {
                  "name": "${USER}",
                  "softMemoryLimit": "10%",
                  "hardConcurrencyLimit": 1,
                  "maxQueued": 100
                }
              ]
            },
            {
              "name": "bi-${tool_name}",
              "softMemoryLimit": "10%",
              "hardConcurrencyLimit": 10,
              "maxQueued": 100,
              "schedulingWeight": 10,
              "schedulingPolicy": "weighted_fair",
              "subGroups": [
                {
                  "name": "${USER}",
                  "softMemoryLimit": "10%",
                  "hardConcurrencyLimit": 3,
                  "maxQueued": 10
                }
              ]
            }
          ]
        },
        {
          "name": "pipeline",
          "softMemoryLimit": "80%",
          "hardConcurrencyLimit": 45,
          "maxQueued": 100,
          "schedulingWeight": 1,
          "jmxExport": true,
          "subGroups": [
            {
              "name": "pipeline_${USER}",
              "softMemoryLimit": "50%",
              "hardConcurrencyLimit": 5,
              "maxQueued": 100
            }
          ]
        }
      ]
    },
    {
      "name": "admin",
      "softMemoryLimit": "100%",
      "hardConcurrencyLimit": 50,
      "maxQueued": 100,
      "schedulingPolicy": "query_priority",
      "jmxExport": true
    }
  ],
  "selectors": [
    {
      "user": "bob",
      "group": "admin"
    },
    {
      "source": ".*pipeline.*",
      "queryType": "DATA_DEFINITION",
      "group": "global.data_definition"
    },
    {
      "source": ".*pipeline.*",
      "group": "global.pipeline.pipeline_${USER}"
    },
    {
      "source": "jdbc#(?<tool_name>.*)",
      "clientTags": ["hipri"],
      "group": "global.adhoc.bi-${tool_name}.${USER}"
    },
    {
      "group": "global.adhoc.other.${USER}"
    }
  ],
  "cpuQuotaPeriod": "1h"
}

Okay, so there is clearly a LOT going on here so let’s start with the basics and roll our way up. The first place to start is understanding the mechanisms Presto uses to enforce query resource limitation.

Penalties

Presto doesn’t enforce any resources at execution time. Rather, Presto introduces a concept of a ‘penalty’ for users who exceed their resource specification. For example, if user ‘bob’ were to kick off a huge query that ended up taking vastly more CPU time than allotted, then ‘bob’ would incur a penalty which translates to an amount of time that ‘bob’s’ queries would be forced to wait in a queued state until they could be runnable again. To see this scenario at hand, let’s split the cluster resources by half and see what happens when two users attempt to submit 5 different queries each at the same time.

Resource Group Specifications

The example below is a resource specification of how to evenly distribute CPU resources between two different users.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 5,
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 5,
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1h"
}

The above resource config defines two main resource groups called ‘query1’ and ‘query2’. These groups will serve as buckets for the different queries/users. A few parameters are at work here:

  • hardConcurrencyLimit sets the number of concurrent queries that can be run within the group
  • maxQueued sets the limit on how many queries can be queued
  • schedulingPolicy ‘fair’ determines how queries within the same group are prioritized

Kicking off a single query as each user has no effect, but subsequent queries will stay QUEUED until the first completes. This at least confirms the hardConcurrencyLimit setting. Testing queuing 6 queries also shows that the maxQueued is working as intended as well.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "30s",
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "30s",
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1m"
}

Introducing the soft CPU limit will penalize any query that is caught using too much CPU time in a given CPU period. Currently this is set to 1 minute and each group is given half of that CPU time. However, testing the above configuration yielded some odd results. Mostly, once the first query finished, other queries were queued for an inordinately long amount of time. Looking at the Presto source code shows the reasoning. The softCpuLimit and hardCpuLimit are based on a combination of total cores and the cpuQuotaPeriod. For example, on a 10 node cluster with r5.2xlarge instances, each Presto Worker node has 8 vCPU. This leads to a total of 80 vCPU for the worker which then results in 80m of vCPUminutes in the given cpuQuotaPeriod. Therefore, the correct values are shown  below.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "40m",
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "40m",
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1m"
}

With testing, the above resource group spec results in two queries completing – using a total of 127m CPU time. From there, all further queries block for about 2 minutes before they run again. This blocked time adds up because for every minute of cpuQuotaPeriod, each user is granted 40 minutes back on their penalty. Since the first minute queries exceeded by 80+ minutes, it would take 2 cpuQuotaPeriods to bring the penalty back down to zero so queries could submit again.

Conclusion

Resource group implementation in Presto definitely has some room for improvement. The most obvious is that for ad hoc users who may not understand the cost of their query before execution, the resource group will heavily penalize them until they submit very low cost queries. However, this solution will minimize the damage that a single user can perform on a cluster over an extended duration and will average out in the long run. Overall, resource groups are better suited for scheduled workloads which depend on variable input data so that a specified scheduled job doesn’t arbitrarily end up taking over a large chunk of resources. For resource partitioning between multiple users/teams the best approach still seems to be to run and maintain multiple segregated Presto clusters.


Ready to get started with Presto? Check out our tutorial series where we cover the basics: Presto 101: Installing & Configuring Presto locally.

Query editor

Querying Parquet Files using AWS Amazon Athena

Parquet is one of the latest file formats with many advantages over some of the more commonly used formats like CSV and JSON. Specifically, Parquet’s speed and efficiency of storing large volumes of data in a columnar format are big advantages that have made it more widely used. It supports many optimizations and stores metadata around its internal contents to support fast lookups and searches by modern distributed querying/compute engines like PrestoDB, Spark, Drill, etc. Here are steps to quickly get set up to query your parquet files with a service like Amazon Athena.

Prerequisites

  • Sample Parquet Data –  https://ahana-labs.s3.amazonaws.com/movielens/ratings/ratings.parquet
  • AWS Account and Role with access to below services:
    • AWS S3
    • AWS Glue (Optional but highly recommended)
    • AWS Athena

Setting up the Storage

For this example we will be querying the parquet files from AWS S3. To do this, we must first upload the sample data to an S3 bucket. 

Log in to your AWS account and select the S3 service in the Amazon Console.

  1. Click on Create Bucket
  2. Choose a name that is unique. For this example I chose ‘athena-parquet-<your-initials>’. S3 is a global service so try to include a unique identifier so that you don’t choose a bucket that has already been created. 
  3. Scroll to the bottom and click Create Bucket
  4. Click on your newly created bucket
  5. Create a folder in the S3 bucket called ‘test-data’
  6. Click on the newly created folder
  7. Choose Upload Data and upload your parquet file(s).

Running a Glue Crawler

Now that the data is in S3, we need to define the metadata for the file. This can be tedious and involve using a different reader program to read the parquet file to understand the various column field names and types. Thankfully, AWS Glue provides a service that can scan the file and fill in the requisite metadata auto-magically. To do this, first navigate to the AWS Glue service in the AWS Console.

  1. On the AWS Glue main page, select ‘Crawlers’ from the left hand side column
  2. Click Add Crawler
  3. Pick a name for the crawler. For this demo I chose to use ‘athena-parquet-crawler’. Then choose Next.
  4. In Crawler Source Type, leave the settings as is (‘Data Stores’ and ‘Crawl all folders’) and choose Next.
  5. In Data Store under Include Path, type in the URL of your S3 bucket. It should be something like ‘s3://athena-parquet-<your-initials>/test-data/’.
  6. In IAM Role, choose Create an IAM Role and fill the suffix with something like ‘athena-parquet’. Alternatively, you can opt to use a different IAM role with permissions for that S3 bucket.
  7. For Frequency leave the setting as default and choose Next
  8. For Output, choose Add Database and create a database with the name ‘athena-parquet’. Then choose Next.
  9. Review and then choose Finish.
  10. AWS will prompt you if you would like to run the crawler. Choose Run it now or manually run the crawler by refreshing the page and selecting the crawler and choosing the action Run.
  11. Wait for the crawler to finish running. You should see the number 1 in the column Tables Added for the crawler.

Querying the Parquet file from AWS Athena

Now that the data and the metadata are created, we can use AWS Athena to query the parquet file. Choose the Athena service in the AWS Console.

  1. Choose Explore the Query Editor and it will take you to a page where you should immediately be able to see a UI like this:
  1. Before you can proceed, Athena will require you to set up a Query Results Location. Select the prompt and set the Query Result Location to ‘s3://athena-parquet-<your-initials>/test-results/’.
  2. Go back to the Editor and type the following statement: ‘SELECT * FROM test_data LIMIT 10;’ The table name will be based on the folder name you chose
  3. The final result should look something like this: in the S3 storage step.

Conclusion

Some of these steps, like using Glue Crawlers, aren’t required but are a better approach for handling Parquet files where the schema definition is unknown. Athena itself is a pretty handy service for getting hands on with the files themselves but it does come with some limitations. 

Those limitations include concurrency limits, price performance impact, and no control of your deployment. Many companies are moving to a managed service approach, which takes care of those issues. Learn more about AWS Athena limitations and why you should consider a managed service like Ahana for your SQL on S3 needs.

Configuring RaptorX – a multi-level caching with Presto

Multi-level-Data-Lake-Cashing-with-RaptorX

RaptorX Background and Context

Meta introduced a multi-level cache at PrestoCon 2021. Code-named the “RaptorX Project,” it aims to make Presto 10x faster on Meta- scale petabyte workloads. Here at Ahana, engineers have also been working on RaptorX to help make it  usable for the community by fixing a few open issues, tuning and testing heavily with other workloads. This is a unique and very powerful feature only available in PrestoDB and not any other versions or forks of the Presto project.

Presto is the disaggregated compute-storage query engine, which helps customers and cloud providers scale independently and reduce costs. However, storage-compute disaggregation also brings new challenges for query latency as scanning huge amounts of data between storage tier and compute tier is going to be IO-bound over the network.  As with any database, optimized I/O is a critical concern to Presto. When possible, the priority is to not perform any I/O at all. This means that memory utilization and caching structures are of utmost importance.

Let’s understand the normal workflow of how Presto-Hive connector works –

  1. During a read operation, the planner sends a request to the metastore for metadata (partition info)
  2. Scheduler sends requests to remote storage to get a list of files and does the scheduling
  3. On the worker node, first, it receives the list of files from the scheduler and sends a request to remote storage to open a file and read the file footers
  4. Based on the footer, Presto understands what are the data blocks or chucks we need to read from remote storage
  5. Once workers read them, Presto performs computation on the leaf worker nodes based on join or aggregation and does the shuffle back to send query results to the client.

These are a lot of RPC calls not just for the Hive Metastore to get the partitions information but also for the remote storage to list files, schedule those files, to open files, and then to retrieve and read those data files from remote storage. Each of these IO paths for Hive connectors is a bottleneck on query performance and this is the reason we build multi-layer cache intelligently so that you can max cache hit rate and boost your query performance.

RaptorX introduces a total five types of caches and a scheduler. This cache system is only applicable to Hive connectors.

Multi-layer CacheTypeAffinity SchedulingBenefits
Data IO Local DiskRequiredReduced query latency
Intermediate Result SetLocal DiskRequiredReduced query latency and CPU utilization for aggregation queries 
File MetadataIn-memoryRequiredReduced CPU & latency decrease
Metastore In-memoryNAReduced query latency
File ListIn-memoryNAReduced query latency
Table: Summary of Presto Multi Layer Cache Implementation

Further, this article explains how you can configure and test various layers of RaptorX cache in your Presto cluster.

#1 Data(IO) cache

This cache makes use of a library which is built using the alluxio LocalCacheFileSystem which is an implementation of the HDFS interface. The alluxio data cache is the worker node local disk cache that stores the data read from the files(ORC,Parquet etc.,) on remote storage. The default page size on disk is 1MB. Uses LRU policy for evictions and in order to enable this cache we require local disks. 

To enable this cache, worker configuration needs to be updated with below properties at

etc/catalog/<catalog-name>.properties 

cache.enabled=true 
cache.type=ALLUXIO 
cache.alluxio.max-cache-size=150GB — This can be set based on the requirement. 
cache.base-directory=file:///mnt/disk1/cache

Also add below Alluxio property to coordinator and worker etc/jvm.config to emit all metrics related to Alluxio cache
-Dalluxio.user.app.id=presto

#2 Fragment result set cache

This is nothing but an intermediate reset set cache that lets you cache partially computed results set on the worker’s local SSD drive. This is to prevent duplicated computation upon multiple queries which will improve your query performance and decrease CPU usage. 

Add the following properties under the /config.properties

fragment-result-cache.enabled=true 
fragment-result-cache.max-cached-entries=1000000 
fragment-result-cache.base-directory=file:///data/presto-cache/2/fragmentcache 
fragment-result-cache.cache-ttl=24h

#3 Metastore cache

A Presto coordinator caches table metadata (schema, partition list, and partition info) to avoid long getPartitions calls to metastore. This cache is versioned to confirm validity of cached metadata.

In order to enable metadata cache set below properties at /<catalog-name>.properties 

hive.metastore-cache-scope=PARTITION
hive.metastore-cache-ttl=2d
hive.metastore-refresh-interval=3d
hive.metastore-cache-maximum-size=10000000

#4 File List cache

A Presto coordinator caches file lists from the remote storage partition directory to avoid long listFile calls to remote storage. This is coordinator only in-memory cache.

Enable file list cache by setting below set of properties at

/catalog/<catalog-name>.properties 

# List file cache
hive.file-status-cache-expire-time=24h 
hive.file-status-cache-size=100000000 
hive.file-status-cache-tables=*

#5 File metadata cache

Caches open file descriptors and stripe/file footer information in worker memory. These pieces of data are most frequently accessed when reading files. This cache is not just useful for decreasing query latency but also to reduce CPU utilization.

This is in memory cache and suitable for ORC and Parquet file formats.

For ORC, it includes file tail(postscript, file footer, file metadata), stripe footer and stripe stream(row indexes/bloom filters).

For Parquet, it caches the file and block level metadata.

In order to enable metadata cache set below properties at /<catalog-name>.properties 

# For ORC metadata cache: <catalog-name>.orc.file-tail-cache-enabled=true 
<catalog-name>.orc.file-tail-cache-size=100MB 
<catalog-name>.orc.file-tail-cache-ttl-since-last-access=6h 
<catalog-name>.orc.stripe-metadata-cache-enabled=true 
<catalog-name>.orc.stripe-footer-cache-size=100MB 
<catalog-name>.orc.stripe-footer-cache-ttl-since-last-access=6h 
<catalog-name>.orc.stripe-stream-cache-size=300MB 
<catalog-name>.orc.stripe-stream-cache-ttl-since-last-access=6h 

# For Parquet metadata cache: 
<catalog-name>.parquet.metadata-cache-enabled=true 
<catalog-name>.parquet.metadata-cache-size=100MB 
<catalog-name>.parquet.metadata-cache-ttl-since-last-access=6h

The <catalog-name> in the above configuration should be replaced by the catalog name that you are setting these in. For example, If the catalog properties file name is ahana_hive.properties then it should be replaced with “ahana_hive”. 

#6 Affinity scheduler

With affinity scheduling, Presto Coordinator schedules requests that process certain data/files to the same Presto worker node  to maximize the cache hits. Sending requests for the same data consistently to the same worker node means less remote calls to retrieve data.

Data caching is not supported with random node scheduling. Hence, this is a must have property that needs to be enabled in order to make RaptorX Data IO, Fragment result cache, and File metadata cache working. 

In order to enable affinity scheduler set below property at /catalog.properties

hive.node-selection-strategy=SOFT_AFFINITY

How can you test or debug your RaptorX cache setup with JMX metrics?

Each section describes queries to be run and query the jmx metrics to verify the cache usage.

Note: If your catalog is not named ‘ahana_hive’, you will need to change the table names to verify the cache usage. Substitute ahana_hive with your catalog name.

Data IO Cache

Queries to trigger Data IO cache usage

USE ahana_hive.default; 
SELECT count(*) from customer_orc group by nationkey; 
SELECT count(*) from customer_orc group by nationkey;

Queries to verify Data IO data cache usage

-- Cache hit rate.
SELECT * from 
jmx.current."com.facebook.alluxio:name=client.cachehitrate.presto,type=gauges";

-- Bytes read from the cache
SELECT * FROM 
jmx.current."com.facebook.alluxio:name=client.cachebytesreadcache.presto,type=meters";

-- Bytes requested from cache
SELECT * FROM 
jmx.current."com.facebook.alluxio:name=client.cachebytesrequestedexternal.presto,type=meters";

-- Bytes written to cache on each node.
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CacheBytesWrittenCache.presto,type=meters";

-- The number of cache pages(of size 1MB) currently on disk
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CachePages.presto,type=counters";

-- The amount of cache space available.
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CacheSpaceAvailable.presto,type=gauges";

-- There are many other metrics tables that you can view using the below command.
SHOW TABLES FROM 
jmx.current like '%alluxio%';

Fragment Result Cache

An example of the query plan fragment that is eligible for having its results cached is shown below.

Fragment 1 [SOURCE] 
Output layout: [count_3] Output partitioning: SINGLE [] Stage Execution 
Strategy: UNGROUPED_EXECUTION 
- Aggregate(PARTIAL) => [count_3:bigint] count_3 := "presto.default.count"(*) 
- TableScan[TableHandle {connectorId='hive', 
connectorHandle='HiveTableHandle{schemaName=default, tableName=customer_orc, 
analyzePartitionValues=Optional.empty}', 
layout='Optional[default.customer_orc{}]'}, gr Estimates: {rows: 150000 (0B), 
cpu: 0.00, memory: 0.00, network: 0.00} LAYOUT: default.customer_orc{}

Queries to trigger fragment result cache usage:

SELECT count(*) from customer_orc; 
SELECT count(*) from customer_orc;

Query Fragment Set Result cache JMX metrics.

-- All Fragment result set cache metrics like cachehit, cache entries, size, etc 
SELECT * FROM 
jmx.current."com.facebook.presto.operator:name=fragmentcachestats";

ORC metadata cache

Queries to trigger ORC cache usage

SELECT count(*) from customer_orc; 
SELECT count(*) from customer_orc;

Query ORC Metadata cache JMX metrics

-- File tail cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_orcfiletail,type=cachestatsmbean";

 -- Stripe footer cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripefooter,type=cachestatsmbean"; 

-- Stripe stream(Row index) cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripestream,type=cachestatsmbean";

Parquet metadata cache

Queries to trigger Parquet metadata cache

SELECT count(*) from customer_parquet; 
SELECT count(*) from customer_parquet;

Query Parquet Metadata cache JMX metrics.

-- Verify cache usage 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_parquetmetadata,type=cachestatsmbean";

File List cache

Query File List cache JMX metrics.

-- Verify cache usage 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive,type=cachingdirectorylister";

In addition to this, we have enabled these multilayer caches on Presto for Ahana Cloud by adding S3 support as the external filesystem for Data IO cache, more optimized scheduling and tooling to visualize the cache usage. 

Multi-level Data Lake Cashing with RaptorX
Figure: Multi-level Data Lake Cashing with RaptorX

Ahana-managed Presto clusters can take advantage of RaptorX cache and at Ahana we have simplified all these steps so that data platform users can enable these Data Lake caching seamlessly with just a one click. Ahana Cloud for Presto enables you to get up and running with the Open Data Lake Analytics stack in 30 minutes. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our on-demand webinar where we share how you can build an Open Data Lake Analytics stack.

icon-aws-lake-formation.png

The Role of Blueprints in Lake Formation on AWS

Why does this matter?

There are 2 major steps to create a Data Lakehouse on AWS, first is to set up your S3-based Data Lake and second is to run analytical queries on your data lake. A popular SQL engine that you can use is Presto. This article is focused on the first step and how AWS Lake Formation Blueprints can make that easy and automated. Before you can run analytics to get insights, you need your data continuously pooling into your lake!

AWS Lake Formation helps with the time-consuming data wrangling involved with maintaining a Data Lake. It makes that simple and secure. In Lake Formation, there is the Workflows feature. Workflows encompasses a complex set of ETL jobs to load and update data. 

work flow diagram

What is a Blueprint?

A Lake Formation Blueprint allows you to easily stamp out and create workflows. This is an automation capability within Lake Formation. There are 3 types: Database snapshots, incremental database, and log file blueprints.

The database blueprints support automated data ingestion of sources like MySQL, PostgreSQL, SQL service to the Open Data Lake. It’s a point and click service with simple forms in the AWS console.

A Database snapshot does what it sounds like, it loads all the tables from a JDBC source to your lake. This is good when you want time stamped end-of-period snapshots to compare later.

An Incremental database also does what it sounds like, taking only the new data or the deltas into the data lake. This is faster and keeps the latest data in your data lake. The Incremental database blueprint uses bookmarks on columns for each successive incremental blueprint run. 

The Log file blueprint takes logs from various sources and loads them into the data lake. ELB logs, ALB logs, and Cloud Trail logs are an example of popular log files that can be loaded in bulk. 

Summary and how about Ahana Cloud?

Getting data into your data lake is easy, automated, and consistent with AWS Lake Formation. Once you have your data ingested, you can use a managed service like Ahana Cloud for Presto to enable fast queries on your data lake to derive important insights for your users. Ahana Cloud has integrations with AWS Lake Formation governance and security policies. See that page here: https://ahana.io/aws-lake-formation 

lake formation diagram
ahana logo

What Are The Differences Between AWS Redshift Spectrum vs AWS Athena?

While the thrust of this article is an AWS Redshift Spectrum vs Athena comparison, there can be some confusion with the difference between AWS Redshift Spectrum and AWS Redshift. Very briefly, Redshift is the storage layer/data warehouse, and Redshift Spectrum is an extension to Redshift that is a query engine.

Amazon Athena

Athena is Amazon’s standalone, serverless SQL query engine implementation of Presto that is used to query data stored on Amazon S3. It is fully managed by Amazon, there is nothing to setup, manage or configure. This also means that the performance can be very inconsistent as you have no dedicated compute resources.

Amazon Redshift Spectrum

Redshift Spectrum is an extension of Amazon Redshift. It is a serverless query engine that can query both AWS S3 data and tabular data in Redshift using SQL. This enables you to join data stored in external object stores with data stored in Redshift to perform more advanced queries.

Key Features & Differences: Redshift vs Athena

Athena and Redshift Spectrum offer similar functionality, namely, serverless query of S3 data using SQL. That makes them easy to manage and cost-effective as there is nothing to set up and you are only charged based on the amount of data scanned. S3 storage is significantly less expensive than a database on AWS for the same amount of data.

  • Both are serverless, however Spectrum resources are allocated based on your Redshift cluster size, while Athena relies on non-dedicated, pooled resources.
  • Spectrum actually does need a bit of cluster management, but Athena is truly serverless.
  • Performance for Athena depends on your S3 optimization, while Spectrum, as previously noted, depends on your Redshift cluster resources and S3 optimization. If you need a specific query to run more quickly, then you can allocate additional compute resources to it.
  • Redshift Spectrum runs in tandem with Amazon Redshift, while Athena is a standalone query engine for querying data stored in Amazon S3.
  • Spectrum provides more consistency in query performance while Athena has inconsistent results due to the pooled resources.
  • Athena is great for simpler interactive queries, while Spectrum is more oriented towards large, complex queries.
  • The cost for both is the same at $5 per compressed terabyte scanned, however with Spectrum, you must also consider the Redshift compute costs.
  • Both use AWS Glue for schema management, and while Athena is designed to work directly with Glue, Spectrum needs external tables to be configured for each Glue catalog schema.
  • Both support federated queries.

Functionality

The functionality of each is very similar, namely using standard SQL to query the S3 object store. If you are working with Redshift, then Spectrum can join information in S3 with tables stored in Redshift directly. Athena also has a Redshift connector to allow for similar joins, however if you are using Redshift, it would likely make more sense to use Spectrum in this case.

Integrations

Keep in mind that when working with S3 objects, these are not traditional databases, which means there are no indexes to be scanned or used for joins. If you are working with files with high-cardinality and trying to join them, you will likely have very poor performance.

When connecting to data sources other than S3, Athena has a connector ecosystem to work with, which provides a collection of sources that you can directly query with no copy required. Federated queries were added to Spectrum in 2020 and provide a similar capability with the added benefit of being able to perform transformations on the data and load it directly into Redshift tables.

AWS Athena vs Redshift: To Summarize

If you are already using Redshift, then Spectrum makes a lot of sense, but if you are just getting started with the cloud, then the Redshift ecosystem is likely overkill. AWS Athena is a good place to start if you are just getting started on the cloud and want to test the waters at low cost and minimal effort. Athena however quickly runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs will increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous Athena and/or Redshift users that saw challenges around price performance (Redshift) and concurrency/deployment control (Athena). Keep in mind that Athena and Redshift Spectrum provide the same $5 terabyte scanned cost while Ahana is priced purely at instance hours. The power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 
You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/

ahana logo

Ahana Announces New Security Capabilities to Bring Next Level of Security to the Data Lake

Advancements include multi-user support, deep integration with Apache Ranger, and audit support 

San Mateo, Calif. – February 23, 2022 Ahana, the only SaaS for Presto, today announced significant new security features added to its Ahana Cloud for Presto managed service. They include multi-user support for Presto and Ahana, fine-grained access control for data lakes with deep Apache Ranger integration, and audit support for all access. These are in addition to the recently announced one-click integration with AWS Lake Formation, a service that makes it easy to set up a secure data lake in a matter of hours.

The data lake isn’t just the data storage it used to be. More companies are using the data lake to store business-critical data and running critical workloads on top of it, making security on that data lake even more important. With these latest security capabilities, Ahana is bringing an even more robust offering to the Open Data Lake Analytics stack with Presto at its core.

“From day one we’ve focused on building the next generation of open data lake analytics. To address the needs of today’s enterprises that leverage the data lake, we’re bringing even more advanced security features to Ahana Cloud,” said Dipti Borkar, Cofounder and Chief Product Officer, Ahana. “The challenge with data lake security is in its shared infrastructure, and as more data is shared across an organization and different workloads are run on the same data, companies need fine-grained security policies to ensure that data is accessed by the right people. With these new security features, Ahana Cloud will enable faster adoption of advanced analytics with data lakes with advanced security built in.”

“Over the past year, we’ve been thrilled with what we’ve been able to deliver to our customers. Powered by Ahana, our data platform enables us to remain lean, bringing data to consumers when they need it,” said Omar Alfarghaly, Head of Data Science, Cartona. “With advanced security and governance, we can ensure that the right people access the right data.”

New security features include:

  • Multi-user support for Presto: Data platform admins can now seamlessly manage users without complex authentication files and add or remove users for their Presto clusters. Unified user management is also extended across the Ahana platform and can be used across multiple Presto clusters. For example, a data analyst gets access to the analytics cluster but not to the data science cluster.
  • Multi-user support for Ahana: Multiple users are now supported in the Ahana platform. An admin can invite additional users via the Ahana console. This is important for growing data platform teams.
  • Apache Ranger support: Our open source plugin allows users to enable authorization in Ahana-managed Presto clusters with Apache Ranger for both the Hive Metastore or Glue Catalog queries, including fine-grained access control up to the column level across all clusters. In this newest release of the Ahana and Apache Ranger plug-in, all of the open source Presto and Apache Ranger work is now available in Ahana and it’s now incredibly easy to integrate through just a click of a button. With the Apache Ranger plugin, customers can easily add role-based authorization. Policies from Apache Ranger are also now cached in the plugin to enable little to no query time latency impact.  Previously, support for Apache Ranger was only available in open source using complicated config files.
  • Audit support: With extended Apache Ranger capabilities, Ahana customers can enable centralized auditing of user access on Ahana-managed Presto clusters for comprehensive visibility. For example, you can track when users request access to data and if those requests are approved or denied based on their permission levels.
  • AWS Lake Formation integration: Enforce AWS Lake Formation fine-grained data lake access controls with Ahana-managed Presto clusters.

“We’re seeing an increasing proportion of organizations using the cloud as their primary data lake platform to bring all of an enterprise’s raw structured and unstructured data together, realizing significant benefits such as creating a competitive advantage and helping lower operational costs,” said Matt Aslett, VP and Research Director, Ventana Research. “Capabilities such as governance mechanisms that allow for fine-grained access control remain important given the simplicity of the cloud. Innovations that allow for better data governance on the data lake, such as those Ahana has announced today, will help propel usage of more sophisticated use cases.”

Supporting Resources:

Tweet this:  @AhanaIO announces new security capabilities for the data lake #analytics #security #Presto https://bit.ly/3H0Hr7p

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

AWS Lake Formation vs AWS Glue – What are the differences?

As you start building your analytics stack in AWS, there are several AWS technologies to understand as you begin. In this article we’ll discuss two key technologies: AWS Lake Formation for security and governance and AWS Glue, a data catalog. For reference, AWS Lake Formation is built on AWS Glue, and both services share the same AWS Glue data catalog.

AWS Lake Formation 

AWS Lake Formation makes it easier for you to build, secure, and manage data lakes.

AWS Lake Formation gives you a central console where you can discover data sources, set up transformation jobs to move data to an Amazon Simple Storage Service (S3) data lake, remove duplicates and match records, catalog data for access by analytic tools, configure data access and security policies, and audit and control access from AWS analytic and ML services

For AWS users who want to get governance on their data lake, AWS Lake Formation is a service that makes it easy to set up a secure data lake very quickly (in a matter of days), providing a governance layer for Amazon S3. 

Lake Formation creates Glue workflows that integrates source tables, extract the data, and load it to Amazon S3 data lake

When to use AWS Lake Formation? 

  • Build data lakes quickly – this means days not months. You can move, store, update and catalog your data faster, plus automatically organize and optimize your data.
  • Add Authorization on your Data Lake  – You can centrally define and enforce security, governance, and auditing policies.
  • Make data easy to discover and share – Catalog all of your company’s data assets and easily share datasets between consumers.

What is AWS Glue?

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and join data for analytics, machine learning, and application development. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog which discovers and catalogs metadata about your data stores or data lake.  Using the AWS Glue Data Catalog, users can easily find and access data.

When to use AWS Glue?

  • Create a unified data catalog to find data across multiple data stores – View the Data Catalog to quickly search and discover the datasets that you own, and maintain the relevant metadata in one central repository.
  • Data Catalog for data lake analytics with S3 – Organize, cleanse, validate, and format data for storage in a data warehouse or data lake
  • Build ETL pipelines to ingest data into your S3 data lake. 

The data workflows initiated from AWS Lake Formation blueprint are nothing but AWS Glue workflows. You can view and manage these workflows in either the Lake Formation console and the AWS Glue console.

AWS Lake Formation vs AWS Glue: A Summary

AWS Lake formation simplifies security and governance on the Data Lake whereas AWS Glue simplifies the metadata and data discovery for Data Lake Analytics.

Check out our community roundtable where we discuss how you can build simple data lake with the new stack: Presto + Apache Hudi + AWS Glue and S3 = The PHAS3 stack

ahana logo

Amazon S3 Select Limitations

What is Amazon S3 Select?

Amazon S3 Select allows you to use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. 

Why use Amazon S3 Select?

Instead of pulling the entire dataset and then manually extracting the data that you need,  you can use S3 Select to filter this data at the source (i.e. S3). This reduces the amount of data that Amazon S3 transfers, which reduces the cost, latency, and data processing time at the client.

What formats are supported for S3 Select?

Currently Amazon S3 Select only works on objects stored in CSV, JSON, or Apache Parquet format. The stored objects can be compressed with GZIP or BZIP2 (for CSV and JSON objects only). The returned filtered results can be in CSV or JSON, and you can determine how the records in the result are delimited.

How can I use Amazon S3 Select standalone?

You can perform S3 Select SQL queries using AWS SDKs, the SELECT Object Content REST API, the AWS Command Line Interface (AWS CLI), or the Amazon S3 console. 

What are the limitations of S3 Select?

Amazon S3 Select supports a subset of SQL. For more information about the SQL elements that are supported by Amazon S3 Select, see SQL reference for Amazon S3 Select and S3 Glacier Select.

Additionally, the following limits apply when using Amazon S3 Select:

  • The maximum length of a SQL expression is 256 KB.
  • The maximum length of a record in the input or result is 1 MB.
  • Amazon S3 Select can only emit nested data using the JSON output format.
  • You cannot specify the S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, or REDUCED_REDUNDANCY storage classes. 

Additional limitations apply when using Amazon S3 Select with Parquet objects:

  • Amazon S3 Select supports only columnar compression using GZIP or Snappy.
  • Amazon S3 Select doesn’t support whole-object compression for Parquet objects.
  • Amazon S3 Select doesn’t support Parquet output. You must specify the output format as CSV or JSON.
  • The maximum uncompressed row group size is 256 MB.
  • You must use the data types specified in the object’s schema.
  • Selecting on a repeated field returns only the last value.

What is the difference between S3 Select and Presto?

S3 Select is a minimalistic version of pushdown to source with a limited support for the ANSI SQL Dialect. Presto on the other hand is a comprehensive ANSI SQL compliant query engine that can work with various data sources. Here is a quick comparison table.

ComparisonS3 SelectPresto
SQL DialectFairly LimitedComprehensive
Data Format SupportCSV, JSON, ParquetDelimited, CSV, RCFile, JSON, SequenceFile, ORC, Avro, and Parquet
Data SourcesS3 OnlyVarious (Over 26 open-source connectors)
Push-Down CapabilitiesLimited to supported formatsVaries by format and underlying connector

What is the difference between S3 Select and Athena?

Athena is Amazon’s fully managed service for Presto. As such the comparison between Athena and S3 select is the same as outlined above. For a more detailed understanding of the difference between Athena and Presto see here.

How does S3 Select work with Presto?

S3SelectPushdown can be enabled on your hive catalog as a configuration to enable pushing down projection (SELECT) and predicate (WHERE) processing to S3 Select. With S3SelectPushdown Presto only retrieves the required data from S3 instead of entire S3 objects reducing both latency and network usage.

Should I turn on S3 Select for my workload on Presto? 

S3SelectPushdown is disabled by default and you should enable it in production after proper benchmarking and cost analysis. The performance of S3SelectPushdown depends on the amount of data filtered by the query. Filtering a large number of rows should result in better performance. If the query doesn’t filter any data then pushdown may not add any additional value and the user will be charged for S3 Select requests.

We recommend that you benchmark your workloads with and without S3 Select to see if using it may be suitable for your workload. For more information on S3 Select request cost, please see Amazon S3 Cloud Storage Pricing.

Use the following guidelines to determine if S3 Select is a good fit for your workload:

  • Your query filters out more than half of the original data set.
  • Your query filter predicates use columns that have a data type supported by Presto and S3 Select. The TIMESTAMP, REAL, and DOUBLE data types are not supported by S3 Select Pushdown. We recommend using the decimal data type for numerical data. For more information about supported data types for S3 Select, see the Data Types documentation.
  • Your network connection between Amazon S3 and the Presto cluster has good transfer speed and available bandwidth (For the best performance on AWS, your cluster is ideally colocated in the same region and the VPC is configured to use the S3 Gateway endpoint).
  • Amazon S3 Select does not compress HTTP responses, so the response size may increase for compressed input files.

Additional Considerations and Limitations:

  • Only objects stored in CSV format are supported (Parquet is not supported in Presto via the S3 Select configuration). Objects can be uncompressed or optionally compressed with gzip or bzip2.
  • The “AllowQuotedRecordDelimiters” property is not supported. If this property is specified, the query fails.
  • Amazon S3 server-side encryption with customer-provided encryption keys (SSE-C) and client-side encryption is not supported.
  • S3 Select Pushdown is not a substitute for using columnar or compressed file formats such as ORC and Parquet.

S3 Select makes sense for my workload on Presto, how do I turn it on?

You can enable S3 Select Pushdown using the s3_select_pushdown_enabled Hive session property or using the hive.s3select-pushdown.enabled configuration property. The session property will override the config property, allowing you to enable or disable it on a per-query basis. You may need to turn connection properties such as hive.s3select-pushdown.max-connections depending upon your workload.

Ahana and Athena_image

Querying Amazon S3 Data Using AWS Athena

The data lake is becoming increasingly popular for more than just data storage. Now we see much more flexibility with what you can do with the data lake itself – add a query engine on top to get ad hoc analytics, reporting and dashboarding, machine learning, etc. 

How Athena works with Amazon S3

In AWS land, Amazon S3 is the de facto data lake. Many AWS users who want to start easily querying that data will use Amazon Athena, a serverless query service that allows you to run ad hoc analytics using SQL on your data. Amazon Athena is built on Presto, the open source SQL query engine that came out of Meta (Facebook) and is now an open source project housed under the Linux Foundation. One of the most popular use cases is to query S3 with Athena.

The good news about Amazon Athena is that it’s really easy to get up and running. You can simply add the service and start running queries on your S3 data lake right away. Because Athena is based on Presto, you can query data in many different formats including JSON, Apache Parquet, Apache ORC, CSV, and a few more. Many companies today use Athena to query S3.

How to query S3 using Athena

The first thing you’ll need to do is create a new bucket in AWS S3 (or you can you an existing, though for the purposes of testing it out creating a new bucket is probably helpful). You’ll use Athena to query S3 buckets. Next, open up your AWS Management Console and go to the Athena home page. From there you have a few options in how to create a table, for this example just select the “Create table from S3 bucket data” option. 

From there, AWS has made it fairly easy to get up and running in a quick 4 step process where you’ll define the database, table name, and S3 folder where data for this table will come from. You’ll select the data format, define your columns, and then set up your partitions (this is if you have a lot of data). Briefly laid out:

  1. Set up your Database, Table, and Folder Names & Locations
  2. Choose the data format you’ll be querying
  3. Define your columns so Athena understands your data schema
  4. Set up your Data Partitions if needed

Now you’re ready to start querying with Athena. You can run simple select statements on your data, giving you the ability to run SQL on your data lake.

When Athena hits its limits

While Athena is very easy to get up and running, it has known limitations that start impacting price performance as usage grows. That includes query limits, partition limits, deterministic performance, and some others. It’s actually why we see a lot of previous Athena users move to Ahana Cloud for Presto, our managed service for Presto on AWS. 

Here’s a quick comparison between the two offerings:

Looking at Amazon Athena and Ahana Cloud

Some of our customers shared why they moved from AWS Athena to Ahana Cloud. Adroitts saw 5.5X price performance improvement, faster queries, and more control after they made the switch, while SIEM leader Securonix saw 3X price performance improvement along with better performing queries.

We can help you benchmark Athena against Ahana Cloud, get in touch with us today and let’s set up a call.

lake formation_image

What is AWS Lake Formation?

For AWS users who want to get governance on their data lake, AWS Lake Formation is a service that makes it easy to set up a secure data lake very quickly (in a matter of days), providing a governance layer for Amazon S3. 

We’re seeing more companies move to the data lake because it’s flexible, cheaper, and much easier to use than a data warehouse. You’re not locked into proprietary formats, nor do you have to ingest all of your data into a proprietary technology. As more companies are leveraging the data lake, then security becomes even more important because you have more people needing access to that data and you want to be able to control who sees what. 

AWS Lake Formation can help address security on the data lake. For Amazon S3 users, it’s a seamless integration that allows you to get granular security policies in place on your data. AWS Lake Formation gives you three key capabilities:

  1. Build data lakes quickly – this means days not months. You can move, store, update and catalog your data faster, plus automatically organize and optimize your data.
  2. Simplify security management – You can centrally define and enforce security, governance, and auditing policies.
  3. Make data easy to discover and share – Catalog all of your company’s data assets and easily share datasets between consumers.

If you’re currently using AWS S3 or planning to, we recommend looking at AWS Lake Formation as an easy way to get security policies in place on your data lake. As part of your stack, you’ll also need a query engine that will allow you to get analytics on your data lake. The most popular engine to do that is Presto, an open source SQL query engine built for the data lake.

At Ahana, we’ve made it easy to get started with this stack: AWS S3 + Presto + AWS Lake Formation. We provide SaaS for Presto with out of the box integrations with S3 and Lake Formation, so you can get a full data lake analytics stack up and running in a matter of hours.

AWS lake formation diagram

Check out our webinar where we share more about our integration with AWS Lake Formation and how you can actually enforce security policies across your organization.

ahana logo

How does Presto Work With LDAP?

What is LDAP?

The Lightweight Directory Access Protocol (LDAP) is an open, vendor-neutral, industry standard application protocol used for directory services authentication. In LDAP user authentication, the LDAP server authenticates users to directly communicate with the Presto server. 

Presto & LDAP

Presto can be configured to enable LDAP authentication over HTTPS for clients, such as the Presto CLI, or the JDBC and ODBC drivers. At present only a simple LDAP authentication mechanism involving username and password is supported. The Presto client sends a username and password to the coordinator and the coordinator validates these credentials using an external LDAP service.

To enable LDAP authentication for Presto, the Presto coordinator configuration file needs to be updated with LDAP-related configurations. No changes are required to the worker configuration; only the communication from the clients to the coordinator is authenticated. However, if you want to secure the communication between Presto nodes then you should configure Secure Internal Communication with SSL/TLS.

Summary of Steps to Configure LDAP Authentication with Presto:

Step 1: Gather configuration details about your LDAP server

Presto requires Secure LDAP (LDAPS), so make sure you have TLS enabled on your LDAP server as well.

Step 2: Configure SSL/TSL on Presto Coordinator

Access to the Presto coordinator must be through HTTPS when using LDAP authentication.

Step 3: Configure Presto Coordinator with config.properties for LDAP

Step 4: Create a Password Authenticator Configuration (etc/password-authenticator.properties) file on the coordinator

Step 5: Configure Client / Presto CLI with either a Java Keystore file or Java Truststore for its TLS configuration.

Step 6: Restart your Presto Cluster and invoke the CLI with LDAP enabled CLI with  either –keystore-* or –truststore-* or both properties to secure TLS connection.

Reference: https://prestodb.io/docs/current/security/ldap.html

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our presentation with AWS on how to get started in 30min with Presto in the cloud.

Apache Ranger plugin-diagram

What is Apache Ranger?

Apache Ranger™ is a framework to enable, monitor and manage comprehensive data security across the data platform. It is an open-source authorization solution that provides access control and audit capabilities for big data platforms through centralized security administration.

Its open data governance model and plugin architecture enabled the extension of access control to other projects beyond the Hadoop ecosystem, and the platform is widely accepted among “major cloud vendors like AWS, Azure, GCP”. 

With the help of the Apache Ranger console, admins can easily manage centralized, fine-grained access control policies, including file, folder, database, table and column-level policies across all clusters. These policies can be defined at user level, role level or group level.

Apache Service Integration

Apache Ranger uses plugin architecture in order to allow other services to integrate seamlessly with authorization controls.

Apache Ranger plugin diagram

Figure: Simple sequence diagram showing how the Apache Ranger plugin enforces authorization policies with Presto Server.

Apache Ranger also supports centralized auditing of user access and administrative actions for comprehensive visibility of sensitive data usage through a centralized audit store that tracks all the access requests in real time and supports multiple audit stores including Elasticsearch and Solr.

Many companies are today looking to leverage the Open Data Lake Analytics stack, which is the open and flexible alternative to the data warehouse. In this stack, you have flexibility when it comes to your storage, compute, and security to get SQL on your data lake. With Ahana Cloud, the stack includes AWS S3, Presto, and in this case our Apache Ranger integration. 

Ahana Cloud for Presto and Apache Ranger

Ahana-managed Presto clusters can take advantage of Apache Ranger Integration to enforce access control policies defined in Apache Ranger. Ahana Cloud for Presto enables you to get up and running with the Open Data Lake Analytics stack in 30 minutes. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our on-demand webinar where we share how you can build an Open Data Lake Analytics stack we hosted with Dzone.

Benchmarking Warehouse Workloads on the Data Lake using Presto

TPC-H Benchmark Whitepaper

How to run a TPC-H Benchmark on Presto

Presto is an open source MPP Query engine designed from the ground up for high performance with linear scaling. Businesses looking to solve their analytics workload using Presto need to understand how to evaluate Presto performance and this document will help in the endeavor of benchmarking Presto. 

To help users who would like to benchmark Presto, we’ve written a technical guide on how to set up your Presto benchmark using benchto, an open source framework that provides an easy and manageable way to define, run and analyze macro benchmarks in clustered environment.

Running a benchmark on Presto can help you to identify things like: 

  • system resource requirements 
  • resource usage during various operations 
  • performance metrics for such operations
  • ..and more, depending on your workload and use case

This technical guide provides an overview on TPC-H, the industry standard for benchmarking, and explains how to configure and use the open-source Benchto tool to benchmark Presto. It also shows an example on comparing results between two different runs of an Ahana-managed Presto cluster with and without cache enabled.

We hope you find this useful! Happy benchmarking.

AWS & Ahana Lake Formation

Webinar On-Demand
How to Build and Query Secure S3 Data Lakes with Ahana Cloud and AWS Lake Formation

AWS Lake Formation is a service that allows data platform users to set up a secure data lake in days. Creating a data lake with Presto and AWS Lake Formation is as simple as defining data sources and what data access and security policies you want to apply.

In this webinar, we’ll share more on the recently announced AWS Lake Formation and Ahana integration. The AWS & Ahana product teams will cover:

  • Quick overview of AWS Lake Formation & Ahana Cloud
  • The details of the integration
  • How data platform teams can seamlessly integrate Presto natively with AWS Glue, AWS Lake Formation and AWS S3 through a demo

Join AWS Solution Architect Gary Stafford and Ahana Principal Product Manager Wen Phan for this webinar where you’ll learn more about AWS Lake Formation from an AWS expert and get an insider look at how you can now build a secure S3 data lake with Presto and AWS Lake Formation.


Webinar Transcript

SPEAKERS

Ali | Ahana, Wen Phan, Ahana, Gary Stafford | AWS

Ali LeClerc | Ahana 

All right I think we have folks joining, so thanks everyone for getting here bright and early, if you’re on the west coast, or if you’re on the East Coast your afternoon I guess we will get started here in just a few minutes.

Ali LeClerc | Ahana 

I’ll play some music to get people in the right mindset to learn about Lake Formation and Ahana Cloud for Presto. Wen, do you want to share the title slide of your slide deck are you going to start with something else? Up to you.

Wen Phan | Ahana 

I’ll bring it up in a second.

Ali LeClerc | Ahana 

Alright folks, thanks for joining. We’re going to just wait a few more minutes until we get things kicked off here, just to let people join, so give us a few minutes enjoy the music.

Ali LeClerc | Ahana 

Alright folks, so we’re just waiting a few more minutes letting people get logged in and join and we’ll get started here in just a few.

Ali LeClerc | Ahana 

All right. We are three minutes past the hour. So let’s go ahead and get started. Welcome folks to today’s Ahana webinar “How to Build and Secure AWS S3 Data Lakes with Ahana Cloud and AWS Lake Formation.” I’m Ali LeClerc, and I will be moderating today’s webinar. So before we get started, just a few housekeeping items. One is this session is recorded. So afterwards, you’ll get a link to both the recording and the slides. No need to take copious amounts of notes, you will get both the slides and the recording. Second is we did have an AWS speaker Gary Stafford, who will be joining us, he unfortunately had something come up last minute, but he will be joining as soon as he can finish that up. So you will have an AWS expert join. If you do have questions, please save those. And he will be available to take them later on. Last, like I just mentioned, we are doing Q&A at the end. So there’s a Q&A box, you can just pop your questions into that Q&A box at the bottom of your control panel. And again, we have allotted a bunch of time at the end of this webinar to take those questions. So with that, I want to introduce our speaker Wen Phan. Wen is our principal product manager at Ahana, has been working extensively with the AWS Lake Formation team to build out this integration and is an expert in all things Ahana Cloud and AWS Lake Formation. Before I turn things over to him to get started, I want to share or launch a poll, just to get an idea of the audience that we have on the webinar today. How familiar are you with Presto, with data lakes, and with Lake Formation? So if you could take just a few seconds to fill that in, that would be super appreciated. And we can kind of get a sense of who we have on today’s webinar. Wen is going to kind of tailor things on the fly based on the results here. So looks like good. We have some results coming in. Wen can you see this? Or do I need to end it for you to see it? Can you see any of the results?

Wen Phan | Ahana 

I cannot see. Okay, the results?

Ali LeClerc | Ahana 

No worries. So I’m going to wait we have 41% – 50% participation. I’m going to wait a few more seconds here. And then I will end the poll and show it looks like just to kind of give real time familiarity with Presto, most people 75% very little data lakes, I think it’s more spread across the board. 38% very little 44% have played around 90% using them today. Familiar already with Lake formation. 50% says very little. So it looks like most folks are fairly new to these concepts. And that is great to know. So I’ll just wait maybe a few more seconds here. Looks like we have 64% participation. Going up a little, do 10 more seconds. Give people a minute and a half of this and then I will end the poll here. We’re getting closer, we’re inching up. All righty. Cool. I’m going to end the poll. I’m going to share the results. So everybody can kind of see the audience makeup here. Alrighty. Cool. So with that, Wen, I will turn things over to you.

Wen Phan | Ahana 

Awesome. Thanks, Ali. Thanks, everyone for taking that poll that was very, very useful. Like Ali said, I’m a product manager here at Ahana. I’m really excited to be talking about Ahana Cloud and Lake Formation today. It’s been a project that I’ve been working on for several months now. And excited to have it released. So here’s the agenda, we’ll go through today. Pretty straightforward. We’ll start with some overview of AWS Lake Formation, what it is, then transition to what Ahana is, and then talk about the integration between Ahana Cloud for Presto and AWS Lake Formation. So let’s get into it, AWS Lake Formation. So this is actually an AWS slide. Like Ali mentioned, Gary had something come up, so I’ll go ahead and present it. The bottom line is everybody, and companies, want more value from their data. And what you see here on the screen are some of the trends that we’re seeing in terms of the data growing, coming from multiple sources, being very diverse. Images and tax. It’s being democratized more throughout the organization, and more workloads are using the data. So traditional BI workloads are still there. But you’ll see a lot more machine learning data science type workloads. The paradigm that is emerging to support, this proliferation of data with low-cost storage, as well as allowing for multiple applications to consume it is the data lake essentially.

Today, folks that are building and securing data lakes, it’s taking a while, and this is what AWS is seeing. This is the impetus of why they built AWS Lake Formation. There are three kind of high level components to Lake Formation. The first one is to just streamline the process and make building data lakes a lot faster. So try to compress what used to take months to today’s and providing tooling that can make it easier to move store, update and catalog data. The second piece is the security piece. This is actually the cornerstone of what we’ll be demonstrating and talking about today. But how do you go about, securing –  once you have your data in your data lake, how do you go about securing it? Enforcing policies and authorization model? And although data lake is very centralized, sharing the data across the organization, is very important. So another tenant of AWS Lake Formation is to actually make it quite easy or easier to discover and share your data.

So that’s a high level of Lake Formation. Now, we’ll go into Ahana and kind of why we went and built this and worked with AWS at a early stage to integrate with Lake Formation. So first, for those of you who don’t know, Ahana is the Presto company. And I think there are a few of you who are very new to Presto. So this is a single slide essentially giving a high level overview of what Presto is. Presto is a distributed query engine. It’s not a database, it is a way for us to allow you to access different data sources using ANSI SQL and querying it. The benefit of this distributed query nature is you can scale up and as you need it for the for the data. So that’s really the second point. Presto offers very low latency, a performance that can scale to a lot of large amounts of data. The third piece is Presto was also created in a pluggable architecture for connectors. And what this really translates to, is it supports many data sources. And one prominent use case for Presto, in addition to low latency interactive querying is federated querying or querying across data sources.

The final high-level kind of takeaway for Presto, it is open source, it was originally developed at Meta, aka Facebook, and it’s currently under the auspices of the Linux Foundation. And at the bottom of this slide, here are typical use cases of why organizations go ahead and deploy Presto, given the properties that I’ve kind of mentioned above. Here is a architecture diagram of Presto, I just saw a question it’s MPP. To answer that question.

Ali LeClerc | Ahana 

Can you repeat the question? So everybody knows what it was.

Wen Phan | Ahana 

Yeah, the question is your architecture MPP or SMP? It’s MPP. And this is the way it’s kind of laid out kind of, again, very high level. So, the bottom layer, you have a bunch of sources. And you can see it’s very, very diverse. We have everything from NoSQL type databases to typical relational databases, things in the cloud, streaming, Hadoop. And so Presto is kind of this query layer between your storage, wherever your data is, be able to query it. And at the top layer other consumers of the query engine, whether it be a BI tool, a visualization tool, a notebook. Today, I’ll be using a very simple CLI to access Presto, use a Presto engine to query the data on the data lake across multiple sources and get your results back. So this all sounds amazing. So today, if you were to use Presto and try to stand up Presto yourself, you’re running to potentially run to some of the challenges. And basically, you know, maintaining, managing, spinning up a Presto environment can still be complex today. First of all, it is open source. But if you were to just get the open-source bits, you still have to do a lot of legwork to get the remaining infrastructure to actually start querying. So you still need a catalog. I know some of you are new to data lakes, essentially, you have essentially files in some kind of file store. Before it used to be distributed file systems like HDFS Hadoop, today, the predominant one is S3, which is an object store. So you have a bunch of files, but those files really don’t really mean anything in terms of a query until you have some kind of catalog. So if you were to use Presto, at least the open source version, you still have to go figure out – well, what catalog am I going to use to map those files into some kind of relational entity, mental model for them for you to query? The other one is Presto, has been actually around for quite a while, and it was born of the Hadoop era, it has a ton of configurations. And so if you were to kind of spin this up, you’d have to go figure out what those configurations need to be, going to have to figure out the settings, there’s a lot of complexity there, and, tied to the configuration, you wouldn’t know how to really tune it. What’s good out of the box, might have poor out of the box performance. So all of these challenges, in addition to the proliferation of data lakes, is why Ahana was born and the impetus for our product, which is Ahana Cloud for Presto.

We aim to get you from zero to Presto, in 30 minutes or less. It is a managed cloud service, I will be using it today, you will be able to see it in action. But as a managed cloud service, there is no installation or configuration. We specifically designed this for data teams of all experience levels. In fact, a lot of our customers don’t have huge engineering teams and just really need an easy way of managing this infrastructure and providing the Presto query engine for their data practitioners. Unlike other solutions, we take away most of the complexity, but we still give you enough knobs to tune things, we allow you to select the number of workers, the size of the workers that you want, things like that. And obviously, we have many Presto experts within the company to assist our customers. So that’s just a little bit about Ahana Cloud for Presto, if you want to try it, it’s pretty simple. Just go to our website at that address above, like Ali said, you’ll get this recording, and you can go ahead to that site, and then you can sign up. You will need an AWS account. But if you have one, we can go ahead and provision the infrastructure in your account. And you can get up and running with your first Presto cluster pretty quickly. And a pause here, see if there’s another question.

Ali LeClerc | Ahana 

Looks like we have a few. What format is the RDBMS data stored in S3?

Wen Phan | Ahana 

Yeah, so we just talked about data. I would say the de facto standard, today’s Parquet. You can do any kind of delimited format, CSV, ORC files, things like that. And that then just depends on your reader to go ahead and interpret those files. And again, you have to structure that directory layout with your catalog to properly map those files to a table. And then you’ll have another entity called the database on top of the table. You’ll see some of that, well. I won’t go to that low level, but you’ll see the databases and tables when I show AWS Lake Formation integration.

Ali LeClerc | Ahana 

Great. And then, I just want to take a second, Gary actually was able to join. So welcome Gary. Gary is Solutions Architect at AWS and, obviously, knows a lot about Lake Formation. Great to have you on Gary. And he’s available for questions, if anybody has specific Lake Formation questions, so carry on Wen.

Wen Phan | Ahana 

Hey, Gary, thanks for joining. Okay, so try to really keep it tight. So, just quickly about Lake Formation, since many of you are new to it. And again, there are three pieces – making it easier to stand up the data lake, the security part, and the third part being the sharing. What we’re focused on primarily, with our integration, and you’ll see this, is the security part. How do we use Lake Formation as a centralized source of authorization information, essentially. So what are the benefits? Why did we build this integration? And what is the benefit? So first of all, many folks we’re seeing have invested in AWS as their data lake infrastructure of choice. S3 is huge. And a lot of folks are already using Glue today. Lake Formation leverages both Glue and AWS. So it’s, it’s a, it was a very natural decision for us seeing this particular trend. And so for folks that have already invested put into S3, and Glue, this is a basic native integration for you guys. So this is a picture of how it works. But essentially, you have your files stored in your data lake storage – parquet, CSV, or RC – the data catalog is mapping that to databases and tables, all of that good stuff. And then the thing that we’re going to be applying is Lake Formations access control. So you have these databases, you have these tables. And what we’ll see is can you control it can you control access to which user has access to which table? Actually will be able to see which users have access to which columns and which rows. And so that’s basically, the integration that we’ve built in. So someone – the data lake admin – will go ahead and not only define the schemas but define the access and Ahana for Presto will be able to take advantage of those policies that have been centrally defined.

We make this very easy to use, this is a core principle in our product as well, as I kind of alluded to at the beginning. We’re trying to really reduce complexity and make things easy to use and really democratize this capability. So doing this is very few clicks, and through a very simple UI. So today, if you were going to Ahana, and I’m going to show this with the live the live application, if we show you the screens. Essentially, it’s an extension of Glue, so you would have Glue, we have a single click called “Enabled AWS Lake Formation.” When you go ahead and click that, we make it very easy, we actually provide a CloudFormation template, or stack, that you can run that will go ahead and hook up Ahana, your Ahana Cloud for Presto, to your Lake Formation. And that’s it. The second thing that we do is you’ll notice that we have a bunch of users here. So, you have all these users. And then you can map them to essentially your IAM role, which are what the policies are tied to in Lake Formation. So, in Lake Formation, you’re going to create policies based on these roles. You can say, for example, the data admin can see everything, the HR analyst can only see tables in the HR database, whatever. But you have these users that then will be mapped to these roles. And once we know what that mapping is, when you log into presto, as these users, the policies tied to those roles are enforced in your queries. And I will show this. But again, the point here is we make it easy, right? There’s a simple user interface for you to go ahead and make the mapping. There’s a simple user interface where then for you to go ahead and enable the integration.

Wen Phan | Ahana 

Are we FedRAMP certified in AWS? At this moment, we are we are not. That is inbound requests that we have had, and that we are exploring, depending on, I think, the need. Today, we are not FedRAMP certified. Then the final piece is the fine-grained access control. So, leveraging Lake Formation. I mentioned this, you’re going to build your data lake, you’re going to have databases, you’re going to have tables. And you know, AWS Lake Formation has had database level security and table level security for quite some time we offer that. More recently, they’ve added more fine-grained access control. So not only can you control the database and the table you have access to, but also the columns and the specific roles you have access to. The role level one being just announced, a little over a month ago at the most recent re:Invent. We’re actually one of the earliest partners to go ahead and integrate with this feature that essentially just went GA. I’ll show this. Okay, so that was a lot of talking. I’m going to do a quick time check, we’re good on time. I’m going to pause here. Let me go see before I go into the demo, let me see what we have for other questions. Okay, great. I answered the FedRAMP one.

Ali LeClerc | Ahana 

Here’s one that came in – Can Presto integrate with AzureAD AWS SSO / AWS SSO for user management and SSO?

Wen Phan | Ahana 

Okay, so the specific AD question, I don’t know the answer to that. This is probably a two to level question. So, there’s, there’s Presto. Just native Presto that you get out of the box and how you can authenticate to that. And then there is the Ahana managed service. What I can say is single sign-on has been a request and we are working on providing more single sign on capabilities through our managed service. For the open-source Presto itself, I am not aware of any direct capability to AzureAD kind of integration there. If you are interested, I can definitely follow up with a more thorough answer. I think we have who asked that, if you actually are interested, feel free to email us and we can reach out to you.

Ali LeClerc | Ahana 

Thanks, Wen.

Wen Phan | Ahana 

Okay, so we’re going to do the demo. Before I get to the nitty gritty of demo. Let me give you some kind of overview and texture. So let me just orient you, everyone, to the application first, let’s go ahead and do that. So many of you are new to go move this new to Ahana. Once you have Ahana installed, this is what the UI looks like. It’s pretty simple, right? You can go ahead and create a cluster, you can name your cluster, whatever, [example] Acme. We have a few settings, and how large you want your instances, what kind of auto scaling you want. Like we mentioned out of the box, if you need a catalog, we can provide a catalog for you. You can create users so that users can log into this cluster, we have available ones here, you can always create a new one. Step one is create your cluster. And then we’ve separated the notion of a cluster from a data source. That way, you can have multiple clusters and reuse configuration that you have with your data source. For example, if you go to a data source, I could go ahead and create a glue data source. And as you select different data sources, you provide the configuration information specific to that data source. In my case, I’ll do a Lake Formation one. So, I’m going to Lake formation, you’ll select what region your Lake Formation services in. You can use Vanilla Glue as well, you don’t have to use Lake Formation, if you don’t want to use the fine-grained access control. If you want to, and you want to use your policies, you enable Lake Formation, and then you go ahead and run the CloudFormation script stack. And they’ll go ahead and do the integration for you. If you want to do it yourself, or you’re very capable, we do provide information about that in our documentation. So again, we try to make things easy, but we also try to be very transparent. If you want more control on you on your own. But that’s it. And then you can map the roles, as I mentioned before, and then you go ahead and add the data source. And it will go ahead and create the data source. In the interest of time, I’ve already done this.

You can see I have a bunch of data sources, I have some Netflix data on Postgres, it’s not really, real data, it’s just, it’s what we call it. We have another data source for MySQL, I have Vanilla Glue, and I have a Lake Formation one. I have a single cluster right now that’s been idle for some time for two hours called “Analysts.” Once it’s up, you can see by default has three workers. It’s scaled down to one, not a really big deal, because these queries I’m going to run aren’t going to be very, very large. This is actually a time saving feature. But once it’s up, you can connect it you’ll have an endpoint. And whatever tool you want, can connect via JDBC, or the endpoint, we have Superset built in. I’m going to go ahead and use a CLI. But that was just a high-level overview of the product, since folks probably are new to it. But pretty simple. The final thing is you can create your users. And you can see how many clusters your users are attached to. All right, so let’s go back to slideware for a minute and set the stage for what you’re going to see. We’re going to query some data, and we’re going to see the policies in Lake Formation in action.

I’ve set up some data so we can have a scenario that we can kind of follow along and see the various capabilities, the various fine grained access control in Lake Formation. So imagine we work for a company, we have different departments, sales department and HR department. And so let’s say the sales department has their own database. And inside there, they have transactions data, about sales transactions, you have information on the customers, and we have another database for human resources or HR to have employees. So here’s a here’s a sample of what the transaction data could look like. You have your timestamp, you have some customer ID, you have credit card number, you have perhaps the category by which that transaction was meant and you have whatever the amount for that transaction was. Customer data, you have the customer ID, which is just a primary key the ID – first name last name, gender, date of birth, where they live – again fictitious data, but will represent kind of representative of maybe some use cases that you’ll run into. And then HR, instead of customers, pretend you have another table with just your employees. Okay? All right. So let’s say I am the admin, and my name is Annie. And I want to log in, and I’m an admin. I should have access to everything, let’s go ahead and try this. So again, my my cluster is already up, I have the endpoints.

Wen Phan | Ahana 

I’m going to log in as Annie. And let’s take a look at what we see.  Presto has different terminology. And it might seem a little confusing. And I’ll go ahead and decode it for everyone, for those of you that are not familiar. Each connector essentially becomes what is called a catalog. Now, this is very different than a data catalog that we talk about. It’s just what they call it. In my case, the Lake Formation data source, that I created, is called LF for Lake Formation. I also called it LF, because I didn’t want to type as much, just to tie this back to what you are seeing. If we go back to here, you notice that the data source is called LF, and I’ve attached it to this cluster that I created, this analyst cluster that I created and attached it. So that’s why you see the catalog name as LF. So that’s great. And LF is attached to Lake Formation, which is has native integration to Glue and S3.  If I actually look at what databases, they’re called schemas in Presto, I have in LF, I should see the databases that I just showed you. So, you see them and you see, ignore the information schema that’s just kind of metadata information, you see sales, and you see HR. And I can actually take a look at what tables I have in the “sales database,” and I have customers and transactions. And you know, I’m an admin. So I should be able to see everything in the transactions table, for example. And I’ve set this policy already in Lake Formation. So I go here, and I should see all the tables, the same data that I showed you in the PowerPoint. So you see the transaction, the customer ID, the credit card number category, etc. So great, I’m an admin, I can do stuff.

Let me see some questions. What do you type to invoke Presto? Okay, so let’s, let’s be very specific for this question. So Presto is already up, right. So I’ve already provisioned this cluster through Ahana. So when I went and said, create cluster, this spun up all the nodes of the Presto cluster set up, configured it did the coordinator, all of that behind the scenes, that’s Presto. It’s a cluster, it’s a query engine distributed cluster. Then you Presto exposes endpoints, [inaudible] endpoint, a JDBC endpoint, that then you can have a client attached to them. Okay, you can have multiple clients. Most BI tools will be able to access this.

In this case, for the simplicity of this demo, I just use a CLI. So I would basically download the CLI, which is just another Java utility. So you need to have Java installed. And then you run the CLI with some parameters. So I have the CLI, it’s actually called Presto, that’s what the binary is, then I pass it some parameters. And I said, Okay, what’s the server? Here’s the endpoint. So it’s actually connecting from my local desktop to that cluster in AWS with this, but you can’t just access it, you need to provide some credentials. So I’m saying I’m going to authenticate myself with a password.

The user I want to access that cluster with is “Annie,” why is this valid? Well, this is valid, because when I created this cluster, When I created this cluster, I specified which users are available in that cluster. So I have Annie, Harry, I have Olivia, I have Oscar, I have Sally, I have Wally. Okay, so to again, just to summarize, I didn’t invoke Presto, from my desktop, my local machine, I’m just using a client, in this case, the Presto CLI to connect to than a cluster that I provisioned via Ahana Cloud – in the cloud. And I’m just accessing that. As part of that, that cluster is already configured to use Lake Formation. The first thing I did was log log in as Annie, and as we mentioned, Annie is my admin. And as an admin, she can access everything she has access to all the databases, all the tables, etc.

Wen Phan | Ahana 

Okay, so let’s do a much more another interesting case.  And let’s say instead of Annie, I log in as Sally, who is a sales analyst. As a platform owner, I know that Sally in order to do her job, all she needs to look at are transactions. Because let’s say she’s going to forecast what the sales are, or she’s going to do some analysis on what type of transactions have been good. So if we go back and look at the transactions table, this is what it looks like. Now, when I do this, though, I do notice that there’s credit card number, and I know that I don’t really want to expose a credit card number to my analysts, because they don’t need it for their work. So I’m going to go ahead – and also in this policy, for financial information – say, you know, any sales analysts, in this case, Sally, can only have access to the transactions table. And when she accesses the transactions table, she will not be able to see the credit card number. Okay. So let’s go see what this looks like. So instead of Annie, I’m going to log in as Sally. Let’s go ahead and just see what we did here. If we actually look at the data source, Annie got mapped to the role of data admin, so she can see everything. Sally is mapped to the role of “sales analysts,” and therefore can only do what a sales analyst is defined to do in Lake Formation. But the magic is it’s defined in Lake Formation. But Ahana Cloud for Presto can take advantage of that policy.

So I’m going to go ahead and log into Sally. Let’s first take a look at the databases that I can see, they’re called schemas in LF. So first thing you’ll notice is Sally does not see HR, because she doesn’t need to, and she has been restricted, so she can only see sales. Now let’s see what tables let’s see what tables Sally can see. Sally can only see transactions, Sally cannot actually see the customers table. But she doesn’t know this. She’s just running queries. And she’s saying, “Well, this is all I can see. And it’s what I need to do my job. So I’m okay with it.” So let’s actually try to query the transactions table now. So sales, LF sales, transactions. When I tried to do this, I actually get an Access Denied. Why? The reason I get an access denied here is because I cannot actually look at all the columns, I’ve been restricted, there’s only a subset of the columns that I can look at. As I mentioned, we are not able to see the credit card number. So when I tried to do a select star, I can’t really do a star because I can’t see the credit card number, we are making this an improvement where we won’t do an explicit deny. And we’ll just return the columns that you have access to. Otherwise, this can be a little annoying. But the end of the day, you can see the authorization being enforced. You have Presto, and it’s and the policies are being enforced, that are set in Lake Formation.

So now instead of doing a star and actually specifically, paste the columns I have access to – I can see it and I can do whatever I need to do. I can do a group by to see what categories are great. I can do a time series analysis on revenue, I’d get in and then do a forecast for the next three months, whatever I need to do as sales analyst. So that’s great. Okay, so I’m going to go ahead and log out. So let’s go back to this. So we know Sally’s world. So now let’s say you know, the marketing manager Ali here has to marketing analyst and there she’s got them responsible for different regions, and we want to get some demographics on our customers. So we have Wally. And if you look at the customer’s data, there’s a bunch of PII – first name, last name, date of birth. So couple of things, we can automatically say, You know what, they don’t need to see this PII, we’re going to go ahead and mask it with Lake Formation. Okay, and like I mentioned, you know, Ali’s kind of segments of her analysts to have different regions across the Pacific West Coast. So Wally is really responsible for only Washington. So we decided to say hey, on a need to know basis, you’re only really going to get rows back that are from customers that live in Washington. Alright, so let’s go ahead and do that. 

Wen Phan | Ahana 

I’m going to log in as Wally, and let’s go actually see the databases again, just to justice to see it and I’m just showing you different layers of the authorization. So Wally can see skills, not HR, well, let’s see what tables while he can see. So while he should only see customers – Wally can only see customers, he cannot see the transactions, because he’s been restricted to it. Let’s try the same thing again, select star from sales customers. And we expect this to air out why? Because again, PII data, we, we cannot do a star. We don’t allow first name, last name, date of birth, all of that, if I do this, and I go ahead and take the customers out. I’ll see the columns that I want, and I only see the rows that come from Washington, I technically did not have to select the state, I just want to prove that I’m only getting records from Washington.

Let’s try another analyst Olivia. And guess what, Olivia is responsible only for Oregon. So she’s basically up here to Wally. But she’s responsible for Oregon. So I’m going to go ahead and do the same query, which is saved and see what happens. So in this case, Olivia can only see Oregon. What you’re seeing here is basically the fine-grained access control, you’re seeing database restriction, you’re seeing table level restriction, you’re seeing column number restriction, and you’re seeing role level restriction. And you can do as many of these as, as you want. So we talked about Wally, and we know Olivia can only see Oregon, one more persona, actually two more personas, and then we’re done. I think you guys all get the point.

I think I’ve probably done enough sufficient proof that we can in fact enforce policies. So last one is just Harry who’s in HR. So if I actually log in as Harry. Harry should only be able to see the HR data set. So if I go Harry. And I show the tables. Well, first of all, let’s just again, just to be complete, I’m only going to say HR, I want to see the sales data. So you can it’s hairy one see the transactions he couldn’t. And then I can go ahead. And since I already know what the schema is, look at all the employees in this database. And I’ll see everything because I’m in HR, so I can see your personal information. And it doesn’t restrict me.

Okay. And the final thing is, what happens if I have a user that I haven’t mapped any policies to? So I actually have one user here, who is Oscar, and I actually didn’t give Oscar any policies whatsoever. So let me go ahead here. So notice that Oscar is in the cluster. But Oscar is not mapped to any role whatsoever. I go back to my cluster, I go here. Oscar, he is here. So Oscar is a valid user in the cluster. But Oscar has no role. And so by default, if you have no role we deny you access. That’s essentially what’s going to happen. But just to prove it. Oscar is using this cluster, show catalogs, you’ll see the LF? Well, let’s say I try to I try to see what’s in LF, what’s in that connector, Access Denied. Because there is no mapping, you can’t see anything, we’re not going to tell you anything. We’re going to tell you what databases are in there. No tables, nothing. So that’s the case where you know, it’s very secure, you don’t have explicit access, you don’t you don’t get any information. Okay, so I’ve been in demo mode for a while, just wanted to check if there’s any questions or chat. All right, none.

So. So let’s just do a summary of what we saw. And then kind of wrap it up for Q&A. We’re good on time, actually. And give you some information of where you can get more information if you want to, you want to dig in, deep.

So first, the review. So we had all these users, you see the roles, we saw a case where you have all access, you saw the case where you have no access. And I did a bunch of other demos where you saw different varying degrees of access, table, database, column role, all of that stuff. And so that’s what that’s what this really integration brings to folks that have a data lake today. You’ve gotten all your data there. Inside your data lake, you’ve decided that Presto is the way to go in terms of interactive querying, because it scales, it can manage all your data. And now you want a role that’s all your analysts or your data practitioners, but you want to do it in a secure way. And you want to enforce it and you want to do it in one place. Lake Formation doesn’t only integrate with Ahana it can integrate with other tools, within the AWS ecosystem. Sure, defining these policies in one place, and Ahana managed Presto clusters can take advantage of that.

There was a more A technical talk on this, if you’re interested in some of the technical details that we just presented at Presto Con, with my colleague, Jalpreet, who is the main engineer on this, as well as another representative from AWS, Roy. If you’re interested, go ahead and just Google this and go to YouTube. And you can go watch this. And they’ll give you more of the nitty gritty underneath the hood, if you’re interested in that. And that is all I have for plans, content.

Ali LeClerc | Ahana 

Wen what a fantastic demo, thanks for going through all of those. Fantastic. So I wanted to give Gary kind of a chance to share his perspective on the integration and his thoughts on you know what this kind of means, from the AWS point of view. So Gary, if you don’t mind putting on your video, that would be awesome. If you can just say hi to everyone and let you kind of share your thoughts.

Gary Stafford | AWS 

That’s much better than that corporate picture that was up there. Yeah, thank you. And I would also recommend as Wen said to view the PrestoCon video with Roy and Jalpreet. I think they go into a lot a lot of detail in respect to how the integration works under the covers. And also, maybe share two links Ali, I’ll paste them in there. One link, kind of what’s new with AWS Lake formation, Roy mentioned some of the new features that were announced, I’ll drop a link in there to let folks know what’s new, it’s a very actively developed project, there’s a lot of new features coming out. So I’ll share that link. And also, Jalpreet mentioned a lot of the API features. Lake Formation has a number of API’s, I’ll drop a link in there too, that discusses some of those available endpoints and API’s a little better. I’ll just I’ll share my personal vision. And I think of services like Amazon Event Bridge that has a partner integration, which makes it very easy for SaaS partners to integrate with customers on AWS platform, I think it’d be phenomenal at some point if Lake Formation progresses to that standpoint with some of the features that that Roy mentioned and Wen demonstrated today. Where partners like Ahana could integrate with Lake Formation, and get an out of the box data lake, a way to create a data lake, a way to secure a data lake and simply add their analytics engine with their special sauce on top of that, and not have to do that heavy lifting. And I hope that’s the direction that Lake Formation is headed in. I think that’ll be phenomenal to have a better integration story with our partners on AWS.

Ali LeClerc | Ahana 

Fantastic. Thanks, Gary. With that, we have a few questions that have come in. Again, if you do have a question, you can pop it into the Q&A box, or even the chat box. So Wen, I think this one’s for you, can you share a little bit more about the details on what happens with the enabling of the integration?

Wen Phan, Ahana 

Sure, I will answer this question in two ways. I will show you what we’re doing under the hood. So that you know, and kind of this API exchange. And this is a recent release. So let me go ahead and share my screen again. I think and whoever asked the question, if I didn’t answer the question, let me know. So when you go to the data source, like I mentioned, it’s pretty simple. And we make we do that on purpose. So when you enable Lake Formation, you can go ahead and launch this CloudFormation template, which will go ahead and do the integration. What does it actually doing under the hood? So first of all, this is actually a good time for me to introduce our documentation. If you go to ahana.io, all of this is documented. So you go to docs, Lake Formation is tightly coupled with Glue, go to manage data sources, you go to Glue, this will tell you walk you through it. And there’s a section here, that tells you if you didn’t want to use this, like you didn’t want to actually use the CloudFormation. Or you just simply want it to understand what is this really doing, you can go ahead and read about it. The essentially, like Roy mentioned, there’s a bunch of API’s, one of the API’s is this data lake settings API with Lake Formation. If you use the AWS CLI, you can actually see this, and you’ll get a response, what we’re doing is there’s a bunch of flags that you need to set, you have to allow Ahana Presto to actually do the filtering on your behalf. So we’re going to get the data, we’re going to look at the policies and block out anything you’re not supposed to see. And we also are a partner. So the service needs to know that this is a valid partner that is interacting with the Lake Formation service. So that’s all this is doing. You could do this all manually if you really wanted to with the CLI. We just take care of this for you, on your behalf. So that’s what’s going on to enable the integration. The second part, and again, this goes into a lot more detail in this talk is what’s actually happening under the hood. I’m just going to show a quick kind of slide for this. But essentially what’s happening is when you make a query, so you defined everything in AWS when you make a query, our service so in our case, we’re a third party application, we go ahead and talk to Lake Formation, you set this up, we go talk for Lake Formation, we get temporary credentials. And then we know what the policies are. And we are able to access only the data that you’re allowed to see. And then we process it with a query. And then you see kind of in the client, in my case, that’s what you saw in the CLI.

Ali LeClerc | Ahana 

Cool, thanks Wen, thorough answer. Next question that came in is this is this product, a competitor to Redshift? I’m assuming when you say product, do you mean Ahana? But maybe you can talk about both Ahana and Presto Wen?

Wen Phan | Ahana 

Yeah, I mean, it all comes down to your use case. So Redshift is kind of more like a data warehouse. And that’s great. It has its own use cases. And again, Presto can connect to Redshift. So it depends on what you want. I mean, Presto can talk to data lake. So if you have use cases that make more sense on a data lake – Presto, is one way to access it. And actually, if you have use cases that need to span both the data lake and Redshift, Presto can federate that query as well. So it’s just another piece in the ecosystem. I don’t necessarily think it’s a competitor, I think it’s, as with many things, what’s your what’s your use-case and pick the right tool for your use-case.

Ali LeClerc | Ahana 

Great. I think you just mentioned something around Glue, Wen, So somebody asked, do I need to use Glue from my catalog? If I’m using Lake Formation with Ahana Cloud?

Wen Phan | Ahana 

Yes, you do. Yes, you do. It’s a tightly coupled AWS stack, which works very well. And so you do have to use Glue.

Ali LeClerc | Ahana 

All right. So I think we’ve answered a ton of questions along the way, as well as just now. If there are no more, and it looks like no more have come in, then I think we can probably wrap up here. So any last kind of parting thoughts Wen and Gary before we say goodbye to everybody? So on that note, I’m going to post our, our link in here. I don’t know if Wen mentioned, maybe he did, we have a 14-day free trial. So no commitment, you can check out Ahana Cloud for Presto on AWS free for 14 days, play around with it, get started Lake Formation. If you’re interested in learning more, we’ll make sure to put you in touch with Wen who again is the local expert at that at Ahana. And then Gary, of course, is always able to help as well. And so, so feel free to check out our 14-day free trial. And with that, I think that’s it. All right, everyone. Thanks Wen fantastic demo, fantastic presentation. Appreciate it. Gary, thanks for being available. Appreciate all of your support and getting this integration off the ground and into the hands of our customers. So fantastic. Thanks, everybody for joining for sticking through with us till the end. You’ll get a link to the recording and the slides and we’ll see you next time.

Speakers

Gary Stafford

Solutions Architect, AWS

Gary Stafford, AWS

Wen Phan

Principal Product Manager, Ahana

Wen-Phan_Picture
andy sacks

Ahana Responds to Growing Demand for its SaaS for Presto on AWS with Appointment of Chief Revenue Officer

Enterprise Sales Exec Andy Sacks Appointed to Strengthen Go-to-Market Team

San Mateo, Calif. – January 11, 2022 Ahana, the only SaaS for Presto, today announced the appointment of Andy Sacks as Chief Revenue Officer, reporting to Cofounder and CEO Steven Mih. In this role, Andy will lead Ahana Cloud’s global revenue strategy. With over 20 years of enterprise experience, Andy brings expertise in developing significant direct and indirect routes to market across both pre and post sales organizations.

Ahana Cloud for Presto is a cloud-native managed service for AWS that gives customers complete control, better price-performance, and total visibility of Presto clusters and their connected data sources. “We’ve seen rapidly growing demand for our Presto managed service offering which brings SQL to AWS S3, allowing for interactive, ad hoc analytics on the data lake,” said Mih. “As the next step, we are committed to building a world-class Go-To-Market team with Andy at the helm to run the sales organization. His strong background building enterprise sales organizations, as well as his deep experience in the open source space, makes him the ideal choice.”

“I am excited to join Ahana, the only company that is simplifying open data lake analytics with the easiest SaaS for Presto, enabling data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources,” said Sacks. “I am looking forward to leveraging my experiences to help drive Ahana’s growth through innovative Presto use cases for customers without the complexities of managing cloud deployments.”

Prior to Ahana, Andy spent several years as an Executive Vice President of Sales. Most recently at Alloy Technologies, and prior to that at Imply Data and GridGain Systems, he developed and led each company’s global Sales organization, while posting triple digit growth year over year. At both Imply and GridGain, he created sales organizations from scratch. Prior to GridGain, he spent over six years at Red Hat, where he joined as part of the JBoss acquisition. There he developed and led strategic sales teams while delivering substantial revenue to the company. Prior to Red Hat, he held sales leadership roles at Bluestone Software (acquired by HP), RightWorks (acquired by i2) and Inktomi (acquired by Yahoo! and Verity), where he was instrumental in developing the company’s Partner Sales organization. Andy holds a Bachelor of Science degree in Computer Science from California State University, Sacramento.

Supporting Resources

Download a head shot of Andy Sacks https://ahana.io/wp-content/uploads/2022/01/Andy-Sacks.jpg 

Tweet this:  @AhanaIO bolsters Go-To-Market team adding Chief Revenue Officer Andy Sacks #CRO #newhire #executiveteam https://bit.ly/3zJCvBL

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

ahana logo

Ahana Cofounders Make Data Predictions for 2022

Open Data Lake Analytics Stack, Open Source, Data Engineering and More SaaS and Containers Top the List

San Mateo, Calif. – January 5, 2022 Ahana’s Cofounder and Chief Product Officer, Dipti Borkar, and Cofounder and Chief Executive Officer, Steven Mih predict major developments in cloud, data analytics, databases and data warehousing in 2022. 

The COVID-19 pandemic continues to propel businesses to make strategic data-driven shifts. Today more companies are augmenting the traditional cloud data warehouse with cloud data lakes for much greater flexibility and affordability. Combined with more Analytics and AI applications, powerful, cloud-native open source technologies are empowering data platform teams to analyze that data faster, easier and more cost-effectively in SaaS environments. 

Dipti Borkar, Co-founder and Chief Product Officer, outlines the major trends she sees on the horizon in 2022:

  • OpenFlake – the Open Data Lake for Warehouse Workloads: Data warehouses like Snowflake are the new Teradata with proprietary formats. 2022 will be about the Open Data Lake Analytics stack that allows for open formats, open source, open cloud and no vendor lock-in.
  • More Open Source Behind Analytics & AI – As the momentum behind the Open Data Lake Analytics stack to power Analytics & AI applications continues to grow, we’ll see a bigger focus on leveraging Open Source to address flexibility and cost limitations from traditional enterprise data warehouses. Open source cloud-native technologies like Presto, Apache Spark, Superset, and Hudi will power AI platforms at a larger scale, opening up new use cases and workloads.
  • Database Engineering is Cool Again – With the rise of the Data Lake tide, 2022 will make database engineering cool again. The database benchmarking wars will be back and the database engineers who can build a data lake stack with data warehousing capabilities (transactions, security) but without the compromises (lock-in, cost) will win. 
  • A Post-Pandemic Data-Driven Strategic Shift to Out-Of-The-Box Solutions – The pandemic has brought about massive change across every industry and the successful “pandemic” companies were able to pivot from their traditional business model. In 2022 we’ll see less time spent on managing complex, distributed systems and more time focused on delivering business-driven innovation. That means more out-of-the-box cloud solution providers that reduce cloud complexities so companies can focus on delivering value to their customers.
  • More SaaS, More Containers – When it comes to 2022, abstracting the complexities of infrastructure will be the name of the game. Containers provide scalability, portability, extensibility and availability advantages, and technologies like Kubernetes alleviate the pain around building, delivering, and scaling containerized apps. As the SaaS space continues to explode, we’ll see even more innovation in the container space. 

Steven Mih, Co-founder and Chief Executive Officer, outlines a major trend he sees on the horizon in 2022:

  • Investment & Adoption of Managed Services for Open Source Will Soar – More companies will adopt managed services for open source in 2022 as more cloud-native open source technologies become mainstream (Spark, Kafka, Presto, Hudi, Superset). Open source companies offering easier-to-use, managed service versions of installed software enable companies to take advantage of these powerful systems without the resource overhead so they can focus on business-driven innovation.

Tweet this: @AhanaIO announces 2022 #Data Predictions #cloud #opensource #analytics https://bit.ly/3pT0KtZ

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

ahana logo

Ahana and Presto Praised for Technology Innovation and Leadership in Open Source, Big Data and Data Analytics with Recent Industry Awards

San Mateo, Calif. – December 15, 2021 Ahana, the only SaaS for Presto, today announced the addition of many new industry accolades in 2H 2021. Presto, originally created by Meta (Facebook) who open sourced and donated the project to Linux Foundation’s Presto Foundation, is the SQL query engine for the data lake. Ahana Cloud for Presto is the only SaaS for Presto on AWS, a cloud-native managed service that gives customers complete control and visibility of Presto clusters and their data. 

Recent award recognitions, include:

  • 2021 BIG Awards for Business, “Start-up of the Year” –  Ahana is recognized by the Business Intelligence Group as a winner of the 2021 BIG Awards for Business Program in the Start-up of the Year category as a company leading its respective industry.
  • CRN, “Emerging Vendors for 2021 – As part of CRN’s Emerging Vendors for 2021, here are 17 hot big data startups, founded in 2015 or later, that solution providers should be aware of. Ahana is listed for its cloud-native managed service for the Presto distributed SQL query engine for Amazon Web Services.
  • CRN, “2021 Tech Innovator Awards” – From among 373 applicants, CRN staff selected products spanning the IT industry—including in cloud, infrastructure, security, software and devices—that offer both strong differentiation and major partner opportunities. Ahana Cloud for Presto was named a finalist in the Big Data category. 
  • DBTA, “Trend Setting Products in Data and Information Management for 2022” – These products, platforms and services range from long-established offerings that are evolving to meet the needs of their loyal constituents to breakthrough technologies that may only be in the early stages of adoption. However, the common element for all is that they represent a commitment to innovation and seek to provide organizations with tools to address changing market requirements. Ahana is included in this list of most significant products. 
  • Infoworld, “The Best Open Source Software of 2021” – InfoWorld’s 2021 Bossie Awards recognize the year’s best open source software for software development, devops, data analytics, and machine learning. Presto, an open source, distributed SQL engine for online analytical processing that runs in clusters, is recognized with a prestigious Bossie award this year. The Presto Foundation oversees the development of Presto. Meta, Uber, Twitter, and Alibaba founded the Presto Foundation and Ahana is a member.  
  • InsideBIGDATA, “IMPACT 50 List for Q3 and Q4 2021” – Ahana earned an Honorable Mention for both of the last two quarters of the year as one of the most important movers and shakers in the big data industry. Companies on the list have proven their relevance by the way they’re impacting the enterprise through leading edge products and services. 
  • Solutions Review, “Coolest Data Analytics and Business Intelligence CEOs of 2021” – This list of the coolest data analytics CEOs which includes Ahana’s Cofounder and CEO Steven Mih is based on a number of factors, including the company’s market share, growth trajectory, and the impact each individual has had on its presence in what is becoming the most competitive global software market. One thing that stands out is the diversity of skills that these chief executives bring to the table, each with a unique perspective that allows their company to thrive. 
  • Solutions Review, “6 Data Analytics and BI Vendors to Watch in 2022” – This list is an annual listing of solution providers Solutions Review believes are worth monitoring, which includes Ahana. Companies are commonly included if they demonstrate a product roadmap aligning with Solutions Review’s meta-analysis of the marketplace. Other criteria include recent and significant funding, talent acquisition, a disruptive or innovative new technology or product, or inclusion in a major analyst publication.

“We are proud that Ahana’s managed service for Presto has been recognized by top industry publications as a solution that is simplifying open data lake analytics with the easiest SaaS for Presto, enabling data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources,” said Steven Mih, cofounder and CEO, Ahana. “In less than a year, Ahana’s innovation has been proven with innovative use cases delivering interactive, ad-hoc analytics with Presto without having to worry about the complexities of managing cloud deployments.”

Tweet this:  @AhanaIO praised for technology innovation and leadership with new industry #awards @CRN @DBTA @BigDataQtrly @insideBigData @Infoworld @SolutionsReview #Presto #OpenSource #Analytics #Cloud #DataManagement https://bit.ly/3ESDnWy 

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Ahana Cloud for Presto Delivers Deep Integration with AWS Lake Formation Through Participation in Launch Program

Integration enables data platform teams to seamlessly integrate Presto with their existing AWS data services while providing granular security for data

San Mateo, Calif. – December 9, 2021 Ahana, the only SaaS for Presto, today announced Ahana Cloud for Presto’s deep integration with AWS Lake Formation, an Amazon Web Services, Inc. (AWS) service that makes it easy to set up a secure data lake, manage security, and provide self-service access to data with Amazon Simple Storage Service (Amazon S3). As an early partner in the launch program, this integration allows data platform teams to quickly set up a secure data lake and run ad hoc analytics on that data lake with Presto, the de facto SQL query engine for data lakes.

Amazon S3 has quickly become the de facto storage for the cloud, widely used as a data lake. As more data is stored in the data lake, query engines like Presto can directly query the data lake for analytics, opening up a broader set of Structured Query Language (SQL) use cases including reporting and dashboarding, data science, and more. Security of all this data is paramount because unlike databases, data lakes do not have built-in security and the same data can be used across multiple compute engines and technologies. This is what AWS Lake Formation solves for.

AWS Lake Formation enables users to set up a secure data lake in days. It simplifies the security on the data lake, allowing users to centrally define security, governance, and auditing policies in one place, reducing the effort in configuring policies across services and providing consistent enforcement and compliance. With this integration, AWS users can integrate Presto natively with AWS Glue, AWS Lake Formation and Amazon S3, seamlessly bringing Presto to their existing AWS stack. In addition to Presto, data platform teams will get unified governance on the data lake for many other compute engines like Apache Spark and ETL-focused managed services in addition to the already supported AWS native services like Amazon Redshift and Amazon EMR.

“We are thrilled to announce our work with AWS Lake Formation, allowing AWS Lake Formation users seamless access to Presto on their data lake,” said Dipti Borkar, Cofounder and Chief Product Officer at Ahana. “Ahana Cloud for Presto coupled with AWS Lake Formation gives customers the ability to stand up a fully secure data lake with Presto on top in a matter of hours, decreasing time to value without compromising security for today’s data platform team. We look forward to opening up even more use cases on the secure data lake with Ahana Cloud for Presto and AWS Lake Formation.”

The Ahana Cloud and AWS Lake Formation integration has already opened up new use cases for customers. One use case centers around making Presto accessible to internal data practitioners like data engineers and data scientists, who can then in turn develop downstream artifacts (e.g. models, dashboards). Another use case is exposing the data platform to external clients, which is how Ahana customer Metropolis is leveraging the integration. In Metropolis’ case, they can provide their external customers transparency into internal operational data and metrics, enabling them to provide an exceptional customer experience.

“Our business relies on providing analytics across a range of data sources for our clients, so it’s critical that we provide both a transparent and secure experience for them,” said Ameer Elkordy, Lead Data Engineer at Metropolis. “We use Amazon S3 as our data lake and Ahana Cloud for Presto for ad hoc queries on that data lake. Now, with the Ahana and AWS Lake Formation integration, we get even more granular security with data access control that’s easy to configure and native to our AWS stack. This allows us to scale analytics out to our teams without worrying about security concerns.”

Ahana Cloud for Presto on AWS Lake Formation is available today. You can learn more and get started at https://ahana.io/aws-lake-formation

Supporting Resources:

TWEET THIS: @Ahana Cloud for #Presto delivers deep integration with AWS Lake Formation  #OpenSource #Analytics #Cloud https://bit.ly/3Ix9L35

About Ahana

Ahana, the only SaaS for Presto, offers a managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Announcing the Ahana Cloud for Presto integration with AWS Lake Formation

We’re excited to announce that Ahana Cloud for Presto now integrates with AWS Lake Formation, including support for the recent general availability of row-level security.

AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. Customers can manage permissions to data in a single place, making it easier to enforce security across a wide range of tools and services. Over the past several months we’ve worked closely with the AWS Lake Formation team to bring Lake Formation capabilities to Presto on AWS.  Further, we’re grateful to our customers who were willing to preview early versions of our integration.

Today, Ahana Cloud for Presto allows customers to use Presto to query their data protected with AWS Lake Formation fine-grained permissions with a few clicks.  Our customers can bring Presto to their existing AWS stack and scale their data teams without compromising security.  We’re thrilled that the easiest managed service for Presto on AWS just got easier and more secure.

Here’s a quick video tutorial that shows you how easy it is to get started with AWS Lake Formation and Ahana:

Additionally, we’ve put together a list of resources where you can learn more about the integration.

What’s Next?

If you’re ready to get started with AWS Lake Formation and Ahana Cloud, head over to our account sign up page where you can start with a free 14-day trial of Ahana Cloud. You can also drop us a note at product@ahana.io and we can help get you started. Happy building!

Ahana Joins Leading Members of the Presto® Community at PrestoCon as Platinum Sponsor, Will Share Use Cases and Technology Innovations

Ahana to deliver keynote and co-present sessions with Uber, Intel and AWS; Ahana customers to present case studies

San Mateo, Calif. – December 1, 2021 Ahana, the only SaaS for Presto, announced today their participation at PrestoCon, a day dedicated to the open source Presto project taking place on Thursday, December 9, 2021. Presto was originally created by Facebook who open sourced and donated the project to Linux Foundation’s Presto Foundation. Since then it has massively grown in popularity with data platform teams of all sizes.

PrestoCon is a day-long event for the PrestoDB community by the PrestoDB community that will showcase more of the innovation within the Presto open source project as well as real-world use cases. In addition to being the platinum sponsor of the event, Ahana will be participating in 5 sessions and Ahana customer Adroitts will also be presenting their Presto use case. Ahana and Intel will also jointly be presenting on the next-generation Presto which includes the native C++ worker.

“PrestoCon is the marquee event for the Presto community, showcasing the latest development and use cases in Presto,” said Dipti Borkar, Cofounder and Chief Product Officer, Ahana, Program Chair of PrestoCon and Chair of the Presto Foundation Outreach Committee. “In addition to contributors from Meta, Uber, Bytedance (TikTok) and Twitter sharing their work, we’re excited to highlight more within the Presto ecosystem including contributions like Databricks’ delta lake connector for Presto, Twitter’s Presto Iceberg Connector, and Presto on Spark. Together with our customers like Adroitts, Ahana will be presenting the latest technology innovations including governance on data lakes with Apache Ranger and AWS Lake Formation. We look forward to the best PrestoCon to date.”

“PrestoCon continues to be the showcase event for the Presto community, and we look forward to building on the success of this event over the past year to share even more innovation and use of the open source project with the larger community,” said Chris Aniszczyk, Vice President, Developer Relations, The Linux Foundation. “Presto Foundation continues to focus on community adoption, and PrestoCon is a big part of that in helping bring the Presto community together for a day of deep learning and connecting.”

“As members of the Presto Foundation focused on driving innovation within the Presto open source project, we’re looking forward to sharing our work on the new PrestoDB C++ execution engine with the community at this year’s PrestoCon,” said Arijit Bandyopadhyay, CTO of Enterprise Analytics & AI, Head of Strategy – Cloud and Enterprise, Data Platforms Group, Intel. “Through collaboration with other Presto leaders Ahana, Bytedance, and Meta on this project, we’ve been able to innovate at a much faster pace to bring a better and faster Presto to the community.”

Ahana Customers Speaking at PrestoCon

Ahana Sessions at PrestoCon

  • Authoring Presto with AWS Lake Formation by Jalpreet Singh Nanda, software engineer, Ahana and Roy Hasson, Principal Product Manager, Amazon 
  • Updates from the New PrestoDB C++ Execution Engine by Deepak Majeti, principal engineer, Ahana and Dave Cohen, senior principal engineer, Intel.
  • Presto Authorization with Apache Ranger by Reetika Agrawal, software engineer, Ahana
  • Top 10 Presto Features for the Cloud by Dipti Borkar, cofounder & CPO, Ahana

Additionally, industry leaders Bytedance (TikTok), Databricks, Meta, Uber, Tencent, and Twitter will be sharing the latest innovation in the Presto project, including Presto Iceberg Connector, Presto on Velox, Presto on Kafka, new Materialized View in Presto, Data Lake Connectors for Presto, Presto on Elastic Capacity, and Presto Authorization with Apache Ranger.

View all the sessions in the full program schedule.  

PrestoCon is a free virtual event and registration is open

Other Resources

Tweet this: @AhanaIO announces its participation in #PrestoCon #cloud #opensource #analytics #presto https://bit.ly/3l50AwJ

About Ahana

Ahana, the only SaaS for Presto, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

What is a Data Lakehouse Architecture?

The term Data Lakehouse has become very popular over the last year or so, especially as more customers are migrating their workloads to the cloud. This article will help to explain what a Data Lakehouse is, the common architecture of a Data Lakehouse, and how companies are using the Data Lakehouse in production today. Finally, we’ll share a bit on where Ahana Cloud for Presto fits into this architecture and how real companies are leveraging Ahana as the query engine for their Data Lakehouse.

What is a Data Lakehouse?

First, it’s best to explain a Data Warehouse and a Data Lake.

Data Warehouse

A data warehouse is one central place where you can store specific, structured data. Most of the time that’s relational data that comes from transactional systems, business apps, and operational databases. You can run fast analytics on the Data Warehouse with very good price/performance. Using a data warehouse typically means you’re locked into that Data Warehouse’s proprietary formats – the trade off for the speed and price/performance is your data is ingested and locked into that warehouse, so you lose the flexibility of a more open solution.

Data Lake

On the other hand, a Data Lake is one central place where you can store any kind of data you want – structured, unstructured, etc. – at scale. Popular Data Lakes are AWS S3, Microsoft Azure, and Google Cloud Storage. Data Lakes are widely popular because they are very cheap and easy to use – you can literally store an unlimited amount of any kind of data you want at a very low cost. However, the data lake doesn’t provide built-in mechanisms like query, analytics, etc. You need a query engine and data catalog on top of the data lake to query your data and make use of it (that’s where Ahana Cloud comes in, but more on that later).

Data Lakehouse explained_diagram

Data Lakehouse

Now let’s look at the Data Lake vs the Lakehouse. This new data lakehouse architecture has emerged that takes the best of the Data Warehouse and Data Lake. That means it’s open, flexible, has good price/performance, and can scale like the Data Lake, and can also do transactions and have strong security like that of the Data Warehouse.

Data Lakehouse Architecture Explained

Here’s an example of a Data Lakehouse architecture:

An example of a Data Lakehouse architecture

You’ll see the key components include your Cloud Data Lake, your catalog & governance layer, and the data processing (SQL query engine). On top of that you can run your BI, ML, Reporting, and Data Science tools. 

There are a few key characteristics of the Data Lakehouse. First, it’s based on open data formats – think ORC, Parquet, etc. That means you’re not locked into a proprietary format and can use an open source query engine to analyze your data. Your lakehouse data can be easily queried with SQL engines.

Second, a governance/security layer on top of the data lake is important to provide fine-grained access control to data. Last, performance is critical in the Data Lakehouse. To compete with data warehouse workloads, the data lakehouse needs a high-performing SQL query engine on top. That’s where open source Presto comes in, which can provide that extreme performance to give you similar, if not better, price/performance for your queries.

Building your Data Lakehouse with Ahana Cloud for Presto

At the heart of the Data Lakehouse is your high-performance SQL query engine. That’s what enables you to get high performance analytics on your data lake data. Ahana Cloud for Presto is SaaS for Presto on AWS, a really easy way to get up and running with Presto in the cloud (it takes under an hour). This is what your Data Lakehouse architecture would look like if you were using Ahana Cloud:

Building your Data Lakehouse with Ahana Cloud for Presto_diagram

Ahana comes built-in with a data catalog and caching for your S3-based data lake. With Ahana you get the capabilities of Presto without having to manage the overhead – Ahana takes care of it for you under the hood. The stack also includes and integrates with transaction managers like Apache Hudi, Delta Lake, and AWS Lake Formation.

We shared more on how to unlock your data lake with Ahana Cloud in the data lakehouse stack in a free on-demand webinar.

Ready to start building your Data Lakehouse? Try it out with Ahana. We have a 14-day free trial (no credit card required), and in under 1 hour you’ll have SQL running on your S3 data lake.

Presto on Spark

Overview

Presto was originally designed to run interactive queries against data warehouses, but now it has evolved into a unified SQL engine on top of open data lake analytics for both interactive and batch workloads. Popular workloads on data lakes include:

1. Reporting and dashboarding

This includes serving custom reporting for both internal and external developers for business insights and also many organizations using Presto for interactive A/B testing analytics. A defining characteristic of this use case is a requirement for low latency. It requires tens to hundreds of milliseconds at very high QPS, and not surprisingly this use case is almost exclusively using Presto and that’s what Presto is designed for.

2. Data science with SQL notebooks

This use case is one of ad hoc analysis and typically needs moderate latency ranging from seconds to minutes. These are the queries of data scientist, and business analysts who want to perform compact ad hoc analysis to understand product usage, for example, user trends and how to improve the product. The QPS is relatively lower because users have to manually initiate these queries.

3. Batch processing for large data pipelines

These are scheduled jobs that are running every day, hour, or whenever the data is ready. They often contain queries over very large volumes of data and the latency can be up to tens of hours and processing can range from CPU days to years and terabytes to petabytes of data.

Presto works exceptionally effectively for ad-hoc or interactive queries today, and even some batch queries, with the constraint that the entire query must fit in memory and run quickly enough that fault tolerance is not required. Most ETL batch workloads that don’t fit in this box are running on “very big data” compute engines like Apache Spark. Having multiple compute engines with different SQL dialects and APIs makes managing and scaling these workloads complicated for data platform teams. Hence, Facebook decided to simplify and build Presto on Spark as the path to further scale Presto. Before we get into Presto on Spark, let me explain a bit more about the architecture of each of these two popular engines.

Presto’s Architecture

Image

Presto is designed for low latency and follows the classic MPP architecture; it uses in-memory streaming shuffle to achieve low latency. Presto has a single shared coordinator per cluster with an associated pool of workers. Presto tries to schedule as many queries as possible on the same Presto worker (shared executor), in order to support multi-tenancy.

This architecture provides very low latency scheduling of tasks and allows concurrent processing of multiple stages of a query, but the tradeoff is that the coordinator is a SPOF and bottleneck, and queries are poorly isolated across the entire cluster.

Additionally streaming shuffle does not allow for much fault tolerance further impacting the reliability of long running queries.

Spark’s Architecture

Image

On other hand, Apache Spark is designed for scalability from the very beginning and it implements a Map-Reduce architecture. Shuffle is materialized to disk fully between stages of execution with the capability to preempt or restart any task. Spark maintains an isolated Driver to coordinate each query and runs tasks in isolated containers scheduled on demand. These differences improve reliability and reduce overall operational overhead.

Why Presto alone isn’t a good fit for batch workloads?

Scaling an MPP architecture database to batch data processing over Internet-scale datasets is known to be an extremely difficult problem [1]. To simplify this let’s examine the below aggregation query. Essentially this query goes over the orders table in TPCH and does aggregation grouping on custom keys, and summing the total price. Presto leverages in-memory shuffle and executes shuffle on the custom key, after reading the data and doing aggregation for the same key, on each worker.

Image

Doing in-memory shuffle means the producer will buffer data in memory and wait for the data to be fetched by the consumer as a result. We have to execute all the tasks, before and after the exchange at the same time. So thinking about in the mapreduce world all the mappers and the reducer have to be run concurrently. This makes in-memory shuffle an all-or-nothing exclusion model.

This causes inflexible scheduling and scaling query size becomes more difficult because everything is running concurrently. In the aggregation phase the query may exceed the memory limit because everything has to be held in the memory in hash tables in order to track each group (custkey).

Additionally we are limited by the size of a cluster in how many nodes we can hash partition the data across to avoid having to fit it all in memory. Using distributed disk (Presto-on-Spark, Presto Unlimited) we can partition the data further and are only limited by the number of open files and even that is a limit that can be scaled quite a bit by a shuffle service.

For that reason it makes Presto difficult to scale to very large and complex batch pipelines. Such pipelines remain running for hours, all to join and aggregate over a huge amount of data. This motivated the development of Presto Unlimited which adapts Presto’s MPP design to large ETL workloads, and improves user experience at scale.

Image

While Presto Unlimited solved part of the problem by allowing shuffle to be partitioned over distributed disk, it didn’t fully solve fault tolerance, and did nothing to improve isolation and resource management.

Presto on Spark

Presto on Spark is an integration between Presto and Spark that leverages Presto’s compiler/evaluation as a library with Spark’s RDD API used to manage execution of Presto’s embedded evaluation. This is similar to how Google chose to embed F1 Query inside their MapReduce framework.

The high level goal is to bring a fully disaggregated shuffle to Presto’s MPP run time and we achieved this by adding a materialization step right after the shuffle. The materialized shuffle is modeled as a temporary partition table, which brings more flexible execution after shuffle and allows to partition level retries. With Presto on Spark, we can do a fully disaggregated shuffle on custom keys for the above query both on mapper and reducer side, this means all mappers and reducers can be independently scheduled and are independently retriable.

Presto on Spark

Presto On Spark at Intuit

Superglue is a homegrown tool at Intuit that helps users build, manage and monitor data pipelines. Superglue was built to democratize data for analysts and data scientists. Superglue minimizes time spent developing and debugging data pipelines, and maximizes time spent on building business insights and AI/ML.

Many analysts at Intuit use Presto (AWS Athena) to explore data in the Data Lake/S3. These analysts would spend several hours converting these exploration SQLs written for Presto to Spark SQL to operationalize/schedule them as data pipelines in Superglue. To minimize SQL dialect conversion issues and associated productivity loss for analysts, the Intuit team started to explore various options including query translation, query virtualization, and presto on spark. After a quick POC, Intuit decided to go with Presto on Spark as it leverages Presto’s compiler/evaluation as a library (no query conversion is required) and Spark’s scalable data processing capabilities.

Presto on Spark is now in production at Intuit. In three months, there are hundreds of critical pipelines that have thousands of jobs running on Presto On Spark via Superglue.

Presto on Spark runs as a library that is submitted with spark-submit or Jar Task on the Spark cluster. Scheduled batch data pipelines are launched on ephemeral clusters to take advantage of resource isolation, manage cost, and minimize operational overhead. DDL statements are executed against Hive and DML statements are executed against Presto. This enables analysts to write Hive-compatible DDL and the user experience remains unchanged.

This solution helped enable a performant and scalable platform with seamless end-to-end experience for analysts to explore and process data. It thereby improved analysts’ productivity and empowered them to deliver insights at high speed.

When To Use Spark’s Execution Engine With Presto

Spark is the tool of choice across the industry for running large scale complex batch ETL pipelines. Presto on Spark heavily benefits pipelines written in Presto that operate on terabytes/petabytes of data, as it takes advantage of Spark’s large scale processing capabilities. The biggest win here is that no query conversion is required and you can leverage Spark for

  • Scaling to larger data volumes
  • Scaling Presto’s resource management to larger clusters
  • Increase reliability and elasticity of Presto as a compute engine

Why ‘Presto on Spark’ matters

We tried to achieve the following to adapt ‘Presto on Spark’ to Internet-scale batch workloads [2]:

  • Fully disaggregated shuffles
  • Isolated executors
  • Presto resource management, Different Scheduler, Speculative Execution, etc.

A unified option for batch data processing and ad hoc is very important for creating the experience of queries that scale instead of fail without requiring rewrites between different SQL dialects. We believe this is only a first step towards more confluence between the Spark and the Presto communities, and a major step towards enabling unified SQL experience between interactive and batch use cases. Today many internet giants like Facebook, etc. have moved over to Presto on Spark and we have seen many organizations including Intuit started running their complex data pipelines in production with Presto on Spark.

“Presto on Spark” is one of the most active development areas in Presto, feel free check it out and please give it a star! If you have any questions, feel free to ask in the PrestoDB Slack Channel.

Reference

[1] MapReduce: Simplified Data Processing on Large Clusters 

[2] Presto-on-Spark: A Tale of Two Computation Engines

AWS Data Analytics & Competency

Ahana Achieves AWS Data & Analytics ISV Competency Status

AWS ISV Technology Partner demonstrates AWS technical expertise 

and proven customer success

San Mateo, Calif. – November 10, 2021 Ahana, the Presto company, today announced that it has achieved Amazon Web Services (AWS) Data & Analytics ISV Competency status. This designation recognizes that Ahana has demonstrated technical proficiency and proven success in helping customers evaluate and use the tools, techniques, and technologies of working with data productively, at any scale, to successfully achieve their data and analytics goals on AWS.

Achieving the AWS Data & Analytics ISV Competency differentiates Ahana as an AWS ISV Partner in the AWS Partner Network (APN) that possesses deep domain expertise in data analytics platforms based on the open source Presto SQL distributed query engine, having developed innovative technology and solutions that leverage AWS services.

AWS enables scalable, flexible, and cost-effective solutions from startups to global enterprises. To support the seamless integration and deployment of these solutions, AWS established the AWS Competency Program to help customers identify Consulting and Technology APN Partners with deep industry experience and expertise. 

“Ahana is proud to achieve the AWS Data & Analytics ISV Competency, which adds to our AWS Global Startups and AWS ISV Accelerate Partner status,” said Steven Mih, Co-Founder and CEO at Ahana. “Our team is dedicated to helping companies bring SQL to their AWS S3 data lake for faster time-to-insights by leveraging the agility, breadth of services, and pace of innovation that AWS provides.”

TWEET THIS: @Ahana Cloud for #Presto achieves AWS Data and #Analytics Competency Status #OpenSource #Cloud https://bit.ly/3EZXpy1

About Ahana

Ahana, the Presto company, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Tutorial: How to define SQL functions with Presto across all connectors

Presto is the open source SQL query engine for data lakes. It supports many native functions which are usually sufficient for most use cases. However, there is may be a corner case where you need to implement your own function. To simplify this, Presto allows users to define expressions as SQL functions. These are dynamic functions separated from the Presto source code, managed by a functions namespace manager that you can set up with a MySQL database. In fact, this is one of the most widely used features of Presto at Facebook, with over 1000s of functions defined.

Function Namespace Manager

A function namespace is a special catalog.schema that stores functions in the format like mysql.test. Each catalog.schema can be a function namespace. A function namespace manager is a plugin that manages a set of these function catalog schemas. Catalog can be mapped to connectors in Presto (a connector for functions, no tables or view) and allows the Presto engine to perform actions such as creating, altering, and deleting functions.

This user defined function management is separated from connector API for flexibility, hence these SQL functions can be used across all connectors. Further, the query is guaranteed to use the same version of the function throughout the execution and any modification to the functions is versioned. 

Implementation

Today, function namespace manager is implemented with the help of MySQL, so users need to have a running MySQL service to initialize the MySQL based function namespace manager. 

Step 1: Provision MySQL server and generate jdbc url for further access.

Suppose the MySQL server can be reached at localhost:3306, example database url – 

jdbc:mysql://localhost:3306/presto?user=root&password=password

Step 2: Create database & tables in MySQL database to store function namespace manager related data

 CREATE DATABASE presto;
 USE presto;

Step 3: Configure at Presto [2]

Create Function namespace manager configuration under etc/function-namespace/mysql.properties:

function-namespace-manager.name=mysql database-url=jdbc:mysql://localhost:3306/presto?user=root&password=password
function-namespaces-table-name=function_namespaces
functions-table-name=sql_functions

And restart the Presto Service.

Step 4: Create new function namescape

Now once the Presto server is started we will see below tables under presto database (which is being used to manage function namespace) in Mysql –

mysql> show tables;
+---------------------+
| Tables_in_presto    |
+---------------------+
| enum_types          |
| function_namespaces |
| sql_functions       |
+---------------------+
93 rows in set (0.00 sec)

To create a new function namespace ”ahana.default”, insert into the function_namespaces table:

INSERT INTO function_namespaces (catalog_name, schema_name)
    VALUES('ahana', 'default');

Step 5: Create a function and query from Presto [1]

SQL functions_blog

Here is simple example of SQL function for COSECANT: 

presto>CREATE OR REPLACE FUNCTION ahana.default.cosec(x double)
RETURNS double
COMMENT ‘Cosecant trigonometric function'
LANGUAGE SQL
DETERMINISTIC
RETURNS NULL ON NULL INPUT
RETURN 1 / sin(x);

More examples can be found at https://prestodb.io/docs/current/sql/create-function.html#examples [1]

Step 6: Apply the newly created function and SQL query

SQL functions_blog

It is required for users to use fully qualified function name while using in SQL query.

Following the the example of using cosec SQL function in the query. 

presto> select ahana.default.cosec (50) as Cosec_value;
     Cosec_value     
---------------------
 -3.8113408578721053 
(1 row)

Query 20211103_211533_00002_ajuyv, FINISHED, 1 node
Splits: 33 total, 33 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s

Here is another simple example of creating an EpochTimeToLocalDate function to convert Unix time to local timezone under ahana.default function namespace.

presto> CREATE FUNCTION ahana.default. EpochTimeToLocalDate (x bigint) 
     -> RETURNS timestamp 
     -> LANGUAGE SQL 
     -> DETERMINISTIC RETURNS NULL ON NULL INPUT 
     -> RETURN from_unixtime (x);
CREATE FUNCTION

ahana.default.EpochTimeToLocalDate(1629837828) as date;
          date           
-------------------------
 2021-08-24 13:43:48.000 
(1 row)

Note

 function-namespaces-table-name  <The name of the table that stores all the function namespaces managed by this manager> property can be used if there is a use case  to instantiate multiple function namespace managers,  otherwise if we can create functions in only one function namespace manager then it can be utilized across all different databases or connectors. [2]

At Ahana we have simplified all these steps that is MySQL container, Schema, databases, tables and additional configurations required to manage functions and data platforms users just need to create their own SQL functions and use them in SQL queries, that’s it, no need to be wary about provisioning and managing additional MySQL servers. 

Future Roadmap

Remote function Support with remote UDF thrift API 

Allows you to run arbitrary functions that are either not safe or not possible to run within worker JVM: unreliable Java functions, C++, Python, etc.

References

[1] DDL Syntax to use FUNCTIONS

[2] Function Namespace Manager Documentation

Ahana Cofounder Will Present Session At Next Gen Big Data Platforms Meetup hosted by LinkedIn About Open Data Lake Analytics

San Mateo, Calif. – November 2, 2021 — Ahana, the Presto company, today announced that its Cofounder and Chief Product Officer Dipti Borkar will present a session at Next Gen Big Data Platforms Meetup hosted by LinkedIn about open data lake analytics. The event is being held on Wednesday, November 10, 2021.

Session Title: “Unlock the Value of Data with Open Data Lake Analytics.”

Session Time: Wednesday, November 10 at 4:10 pm PT / 7:10 pm ET

Session Presenter: Ahana Cofounder and Chief Product Officer and Presto Foundation Chairperson, Outreach Team, Dipti Borkar

Session Details: Favored for its affordability, data lake storage is becoming standard practice as data volumes continue to grow. Data platform teams are increasingly looking at data lakes and building advanced analytical stacks around them with open source and open formats to future-proof their platforms. This meetup will help you gain clarity around the choices available for data analytics and the next generation of the analytics stack with open data lakes. The presentation will cover: generation of analytics, selecting data lakes vs data warehouses, share how these approaches differ from Hadoop generation, why open matters, use cases and workloads for data lakes, and intro to the data lakehouse stack. 

To register for the Next Gen Big Data Platforms Meetup, please go to the event registration page to purchase a registration.

TWEET THIS: @Ahana to present at Next Gen Big Data Platforms Meetup about Open Data Lake Analytics #Presto #OpenSource #Analytics #Cloud https://bit.ly/3vXKl8S

About Ahana

Ahana, the Presto company, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Presto 105: Running Presto with AWS Glue as catalog on your Laptop

Introduction

This is the 5th tutorial in our Getting Started with Presto series. To recap, here are the first 4 tutorials:

Presto 101: Installing & Configuring Presto locally

Presto 102: Running a three node PrestoDB cluster on a laptop

Presto 103: Running a Prestodb cluster on GCP

Presto 104: Running Presto with Hive Metastore

Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial we will show you how to run Presto with AWS Glue as a catalog on a laptop.

We did mention in the tutorial Presto 104 why we are using a catalog. Just to recap, Presto is a disaggregated database engine. This means that Presto has the top part of the database stack – the SQL parser, compiler, optimizer, scheduler, execution engine – but it does not have other components of the database – this includes the system catalog. In the data lake world, the system catalog where the database schema resides is a catalog. There are two popular catalogs that have emerged – Hive Metastore and AWS Glue catalog.

What is AWS Glue?

AWS Glue is an event-driven, serverless computing platform provided by AWS. AWS Glue provides data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. The AWS Glue catalog does the mapping between the database tables and columns and the objects or files that reside in the data lake. This could be files or immutable objects in AWS S3.

In this tutorial, we will focus on using Presto with the AWS Glue on your laptop.   

This document simplifies the process for a laptop scenario to get you started. For real production workloads, you can try out Ahana Cloud which is a managed service for Presto on AWS and comes pre-integrated with an AWS Glue catalog.

Implementation steps

Step 1: 

Create a docker network namespace, so that containers could communicate with each other using the network namespace.

C:\Users\rupendran>docker network create presto_networkd0d03171c01b5b0508a37d968ba25638e6b44ed4db36c1eff25ce31dc435415b

Step 2: 

Ahanaio has developed a sandbox for prestodb which can be downloaded from docker hub, use the command below to download prestodb sandbox, which comes with all packages needed to run prestodb.

C:\Users\prestodb>docker pull ahanaio/prestodb-sandbox
Using default tag: latest
latest: Pulling from ahanaio/prestodb-sandbox
da5a05f6fddb: Pull complete                                                          

e8f8aa933633: Pull complete                                                          
b7cf38297b9f: Pull complete                                                          
a4205d42b3be: Pull complete                                                          
81b659bbad2f: Pull complete                                                          
ef606708339: Pull complete                                                          
979857535547: Pull complete                                                          
Digest: sha256:d7f4f0a34217d52aefad622e97dbcc16ee60ecca7b78f840d87c141ba7137254
Status: Downloaded newer image for ahanaio/prestodb-sandbox:latest
docker.io/ahanaio/prestodb-sandbox:latest

Step 3:  

Start the instance of the the prestodb sandbox and name it as coordinator

#docker run -d -p 8080:8080 -it --net presto_network --name coordinator
ahanaio/prestodb-sandboxd
b74c6f7c4dda975f65226557ba485b1e75396d527a7b6da9db15f0897e6d47f

Step 4:

We only want the coordinator to be running on this container without the worker node. So let’s edit  the config.properties file and set the node-scheduler.include-cooridinator to false.

sh-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery-server.enabled=true
discovery.uri=http://localhost:8080
sh-4.2#

Step 5:

Restart the docker container running coordinator. Since we updated the config file to run this instance only as a Presto coordinator and stopped the worker service.

# docker restart coordinator

Step 6:

Create three more containers using ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8081:8081 -it --net presto_network --name worker1
ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8082:8082 -it --net presto_network --name worker2
ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8083:8083 -it --net presto_network --name worker3
ahanaio/prestodb-sandbox

Step 7:

Edit the etc/config.properties file in each of the three worker containers and set coordinator to false, http-server.http.port to 8081/8082/8083 respectively for each worker and finally discovery.uri should point to coordinator.

sh-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://coordinator:8080

Step 8:

Now we will Install aws-cli and configure AWS glue on the coordinator and worker containers.

# yum install -y aws-cli

Step 9: 

Create glue user and attach to policy AmazonS3FullAccess and AWSGlueConsoleFull Access

aws iam create-user --user-name glueuser
{
    "User": {
        "Path": "/",
        "UserName": "glueuser",
        "UserId": "AXXXXXXXXXXXXXXXX",
        "Arn": "arn:aws:iam::XXXXXXXXXX:user/glueuser",
        "CreateDate": "2021-10-07T01:07:28+00:00"
    }
}

aws iam list-policies | grep AmazonS3FullAccess
            "PolicyName": "AmazonS3FullAccess",
            "Arn": "arn:aws:iam::aws:policy/AmazonS3FullAccess",

aws iam list-policies | grep AWSGlueConsoleFullAccess
            "PolicyName": "AWSGlueConsoleFullAccess",
            "Arn": "arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess",

aws iam attach-user-policy --user-name glueuser --policy-arn "arn:aws:iam::aws:policy/AmazonS3FullAccess"

aws iam attach-user-policy --user-name glueuser --policy-arn "arn:aws:iam::aws:policy/AmazonS3FullAccess"

Step 10:

Create access key

% aws iam create-access-key --user-name glueuser
{
   "AccessKey": {
       "UserName": "glueuser",
        "AccessKeyId": "XXXXXXXXXXXXXXXXXX", 
       "Status": "Active",
        "SecretAccessKey": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "CreateDate": "2021-10-13T01:50:45+00:00"
    }
}

Step 11:

Run aws configure and enter the access and secret key configured.

aws configure
AWS Access Key ID [None]: XXXXXXXXXXXXXAWS
Secret Access Key [None]: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Default region name [None]:
Default output format [None]

Step 12:

Create /opt/presto-server/etc/catalog/glue.properties file add the AWS Glue properties to presto, this file needs to be added on both coordinator and worker containers. Add the AWS access and secret keys generated in the previous step to hive.metastore.glue.aws-access-key and hive.metastore.glue.aws-secret-key.

connector.name=hive-hadoop2
hive.metastore=glue
hive.non-managed-table-writes-enabled=true
hive.metastore.glue.region=us-east-2
hive.metastore.glue.aws-access-key=<your AWS key>
hive.metastore.glue.aws-secret-key=<your AWS Secret Key>

Step 13:

Restart the coordinator and all worker containers

#docker restart coordinator
#docker restart worker1
#docker restart worker2
#docker restart worker3

Step 14:

Run the presto-cli and use glue as catalog

bash-4.2# presto-cli --server localhost:8080 --catalog glue

Step 15:

Create a schema using S3 location.

presto:default> create schema glue.demo with (location= 's3://Your_Bucket_Name/demo');
CREATE SCHEMA
presto:default> use demo;

Step 16:

Create table under glue.demo schema

presto:demo> create table glue.demo.part with (format='parquet') AS select * from tpch.tiny.part;
CREATE TABLE: 2000 rows
    
Query 20211013_034514_00009_6hkhg, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:06 [2K rows, 0B] [343 rows/s, 0B/s]

Step 17:

Run select statement on the newly created table.

presto:demo> select * from glue.demo.part limit 10; 
partkey |                   name                   |      mfgr      |  brand
---------+------------------------------------------+----------------+---------
       1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13
       2 | blush thistle blue yellow saddle         | Manufacturer#1 | Brand#13
       3 | spring green yellow purple cornsilk      | Manufacturer#4 | Brand#42
       4 | cornflower chocolate smoke green pink    | Manufacturer#3 | Brand#34
       5 | forest brown coral puff cream            | Manufacturer#3 | Brand#32
       6 | bisque cornflower lawn forest magenta    | Manufacturer#2 | Brand#24
       7 | moccasin green thistle khaki floral      | Manufacturer#1 | Brand#11
       8 | misty lace thistle snow royal            | Manufacturer#4 | Brand#44
       9 | thistle dim navajo dark gainsboro        | Manufacturer#4 | Brand#43
      10 | linen pink saddle puff powder            | Manufacturer#5 | Brand#54

Summary

In this tutorial, we provide steps to use Presto with AWS Glue as a catalog on a laptop. If you’re looking to get started easily with Presto and a pre-configured Glue catalog, check out Ahana Cloud, a managed service for Presto on AWS that provides both Hive Metastore and AWS Glue as a choice of catalog for prestodb.

0 to Presto in 30 minutes with AWS & Ahana Cloud

On-Demand Webinar

Data lakes are widely used and have become extremely affordable, especially with the advent of technologies like AWS S3. During this webinar, Gary Stafford, Solutions Architect at AWS, and Dipti Borkar, Cofounder & CPO at Ahana, will share how to build an open data lake stack with Presto and AWS S3.

Presto, the fast-growing open source SQL query engine, disaggregates storage and compute and leverages all data within an organization for data-driven decision making. It is driving the rise of Amazon S3-based data lakes and on-demand cloud computing. 

In this webinar, you’ll learn how:

  • What an Open Data Lake Analytics stack is
  • How you can use Presto to underpin that stack in AWS
  • A demo on how to get started building your Open Data Lake Analytics stack in AWS

Speakers

Gary Stafford

Solutions Architect, AWS

Gary Stafford, AWS

Dipti Borkar

Cofounder & CPO, Ahana

Dipti Borkar, Ahana

Webinar On-Demand
How to Build an Open Data Lake Analytics Stack

While data lakes are widely used and extremely affordable, they are only meant for storage and by themselves provide no direct value. With up to 80% of data stored in the data lake today, how do you unlock the value of the data lake?

The answer is the Open Data Lake Analytics stack. In this webinar, we’ll discuss how to build this stack using 4 key components: open source technologies, open formats, open interfaces & open cloud. Additionally, you’ll learn why open source Presto has become the de facto query engine for the data lake, enabling ad hoc data discovery using SQL.

You’ll learn:

• What an Open Data Lake Analytics Stack is

• How Presto, the de facto query engine for the data lake, underpins that stack

• How to get started building your open data lake analytics stack today

Speaker

Dipti Borkar

Cofounder & CPO, Ahana

Presto 104: Running Presto with Hive Metastore on your Laptop

Introduction

This is the 4th tutorial in our Getting Started with Presto series. To recap, here are the first 3 tutorials:

Presto 101: Installing & Configuring Presto locally

Presto 102: Running a three node PrestoDB cluster on a laptop

Presto 103: Running a Prestodb cluster on GCP

Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial we will show you how to run Presto with Hive Metastore on a laptop.

Presto is a disaggregated engine. This means that Presto has the top part of the database stack – the SQL parser, compiler, optimizer, scheduler, execution engine – but it does not have other components of the database – this includes the system catalog. In the data lake world, the system catalog where the database scheme resides lives in what is called a Catalog. There are two popular catalogs that have emerged. From the Hadoop world – the Hive Metastore continues to be widely used. Note this is different from the Hive Query Engine. This is the system catalog – where information about the table schemas and their locations lives. In AWS, the Glue catalog is also very popular. 

In this tutorial, we will focus on using Presto with the Hive Metastore on your laptop.   

What is the Hive Metastore?

The Hive Metastore is the mapping between the database tables and columns and the objects or files that reside in the data lake. This could be a file system when using HDFS or immutable objects in object stores like AWS S3. This document simplifies the process for a laptop scenario to get you started. For real production workload using Ahana cloud which provides Presto as a managed service with Hive Metastore will be a good choice if you are looking at an easy and performant solution for SQL on AWS S3.

Implementation steps

Step 1

Create a docker network namespace, so that containers could communicate with each other using the network namespace.

C:\Users\rupendran>docker network create presto_networkd0d03171c01b5b0508a37d968ba25638e6b44ed4db36c1eff25ce31dc435415b

Step 2

Ahanaio has developed a sandbox for prestodb which can be downloaded from docker hub, use the command below to download prestodb sandbox, which comes with all packages needed to run prestodb.

C:\Users\prestodb>docker pull ahanaio/prestodb-sandbox
Using default tag: latest
latest: Pulling from ahanaio/prestodb-sandbox
da5a05f6fddb: Pull complete                                                               e8f8aa933633: Pull complete                                                               b7cf38297b9f: Pull complete                                                               a4205d42b3be: Pull complete                                                               81b659bbad2f: Pull complete                                                               3ef606708339: Pull complete                                                               979857535547: Pull complete                                                              
Digest: sha256:d7f4f0a34217d52aefad622e97dbcc16ee60ecca7b78f840d87c141ba7137254
Status: Downloaded newer image for ahanaio/prestodb-sandbox:latestdocker.io/ahanaio/prestodb-sandbox:latest

Step 3:  

Start the instance of the the prestodb sandbox and name it as coordinator

#docker run -d -p 8080:8080 -it --net presto_network --name coordinator
ahanaio/prestodb-sandbox
db74c6f7c4dda975f65226557ba485b1e75396d527a7b6da9db15f0897e6d47f

Step 4:

We only want the coordinator to be running on this container without the worker node. So let’s edit  the config.properties file and set the node-scheduler.include-cooridinator to false.

sh-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080discovery-server.enabled=true
discovery.uri=http://localhost:8080
sh-4.2#

Step 5:

Restart the docker container running coordinator. Since we updated the config file to run this instance only as a Presto coordinator and stopped the worker service.

# docker restart coordinator

Step 6:

Create three more containers using ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8081:8081 -it --net presto_network --name worker1  ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8082:8082 -it --net presto_network --name worker2  ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8083:8083 -it --net presto_network --name worker3  ahanaio/prestodb-sandbox

Step 7:

Edit the etc/config.properties file in each of the three worker containers and set coordinator to false, http-server.http.port to 8081/8082/8083 respectively for each worker and finally discovery.uri should point to coordinator.

sh-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://coordinator:8080

Step 8:

Now we will Install and configure hive on the coordinator container.

Install wget procps and tar 

# yum install -y wget procps tar less

Step 9:

Download and install hive and hadoop packages, set HOME and PATH for JAVA,HIVE and HADOOP 

#HIVE_BIN=https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
#HADOOP_BIN=https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz


#wget --quiet ${HIVE_BIN}
#wget --quiet ${HADOOP_BIN}


#tar -xf apache-hive-3.1.2-bin.tar.gz -C /opt
#tar -xf hadoop-3.3.1.tar.gz -C /opt
#mv /opt/apache-hive-3.1.2-bin /opt/hive
#mv /opt/hadoop-3.3.1 /opt/hadoop


#export JAVA_HOME=/usr
#export HIVE_HOME=/opt/hive
#export HADOOP_HOME=/opt/hadoop
#export PATH=$PATH:${HADOOP_HOME}:${HADOOP_HOME}/bin:$HIVE_HOME:/bin:.
#cd /opt/hive

Step 10:

Download additional jars needed to run with S3

#wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.10.6/aws-java-sdk-core-1.10.6.jar

#wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar

#wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.4/hadoop-aws-2.8.4.jar

#cp aws-java-sdk-core-1.10.6.jar /opt/hadoop/share/hadoop/tools/lib/
#cp aws-java-sdk-s3-1.10.6.jar  /opt/hadoop/share/hadoop/tools/lib/
#cp hadoop-aws-2.8.4.jar  /opt/hadoop/share/hadoop/tools/lib/

echo "export
HIVE_AUX_JARS_PATH=${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-core-1.10.6.ja

r:${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-s3
1.10.6.jar:${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-aws-2.8.4.jar" >>/opt/hive/conf/hive-env.sh

Step 11:

Configure and start hive

cp /opt/hive/conf/hive-default.xml.template /opt/hive/conf/hive-site.xml
mkdir -p /opt/hive/hcatalog/var/log
bin/schematool -dbType derby -initSchema
bin/hcatalog/sbin/hcat_server.sh start

Step 12:

Create /opt/presto-server/etc/catalog/hive.properties file add the hive endpoint to presto, this file needs to be added on both coordinator and worker containers.

If you choose to validate using AWS S3 bucket provide security credentials for the same.

connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
hive.s3.aws-access-key=<Your AWS Key>
hive.s3.aws-secret-key=<your AWS Secret Key>

Step 13:

Restart the coordinator and all worker containers

#docker restart coordinator
#docker restart worker1
#docker restart worker2
#docker restart worker3

Step 14:

Run the presto-cli and use hive as catalog

bash-4.2# presto-cli --server localhost:8080 --catalog hive

Step 15:

Create schema using local or S3 location.

presto:default> create schema tpch with (location='file:///root');
CREATE SCHEMA
presto:default> use tpch;

If you have access to S3 bucket then use the following create command using s3 as destination 

presto:default> create schema tpch with (location='s3a://bucket_name');
CREATE SCHEMA
presto:default> use tpch;

Step 16:

Hive has option to create two types of table, they are

  • Managed tables 
  • External tables

Managed tables are tightly coupled with data on the destination which means if you delete a table then associated data will also be deleted.

External tables are loosely coupled with data, which means it maintains a pointer to the data.so deletion of the table will not delete data on the external location.

The transactional semantics(ACID) is only supported on managed tables.

We will create managed table under hive.tpch schema

Create table under hive.tpch schema

presto:tpch> create table hive.tpch.lineitem with (format='PARQUET') AS SELECT * FROM tpch.sf1.lineitem;
CREATE TABLE: 6001215 rows
Query 20210921_051649_00015_uvkq7, FINISHED, 2 nodes
Splits: 19 total, 19 done (100.00%)
1:48 [6M rows, 0B] [55.4K rows/s, 0B/s]

Step 17:

Do a desc table to see the table.

presto> desc hive.tpch.lineitem     
-> ;    
Column     |    Type     | Extra | Comment
---------------+-------------+-------+--------- 
orderkey      | bigint      |       | 
partkey       | bigint      |       | 
suppkey       | bigint      |       | 
linenumber    | integer     |       | 
quantity      | double      |       | 
extendedprice | double      |       | 
discount      | double      |       | 
tax           | double      |       | 
returnflag    | varchar(1)  |       | 
linestatus    | varchar(1)  |       | 
shipdate      | date        |       | 
commitdate    | date        |       | 
receiptdate   | date        |       | 
shipinstruct  | varchar(25) |       | 
shipmode      | varchar(10) |       | 
comment       | varchar(44) |       |
(16 rows)
Query 20210922_224518_00002_mfm8x, FINISHED, 4 nodes
Splits: 53 total, 53 done (100.00%)
0:08 [16 rows, 1.04KB] [1 rows/s, 129B/s]

Summary

In this tutorial, we provide steps to use Presto with Hive Metastore as a catalog on a laptop. Additionally AWS Glue can also be used as a catalog for prestodb. If you’re looking to get started easily with Presto and a pre-configured Hive Metastore, check out Ahana Cloud, a managed service for Presto on AWS that provides both Hive Metastore and AWS Glue as a choice of catalog for prestodb.

Webinar On-Demand
Unlocking the Value of Your Data Lake

Today, data lakes are widely used and have become extremely affordable as data volumes have grown. However, they are only meant for storage and by themselves provide no direct value. With up to 80% of data stored in the data lake today, how do you unlock the value of the data lake? The value lies in the compute engine that runs on top of a data lake.

During this webinar, Ahana co-founder and Chief Product Officer Dipti Borkar will discuss how to unlock the value of your data lake with the emerging Open Data Lake analytics architecture.

Dipti will cover:

  • Open Data Lake analytics – what it is and what use cases it supports
  • Why companies are moving to an open data lake analytics approach
  • Why the open source data lake query engine Presto is critical to this approach

Speaker

Dipti Borkar

Cofounder & CPO, Ahana

Connecting to Presto with Superset

Presto with Superset

This blog post will provide you with an understanding of how to connect Superset to Presto.

TL;DR

Superset refers to a connection to a distinct data source as a database. A single Presto cluster can connect to multiple data sources by configuring a Presto catalog for each desired data source. Hence, to make a Superset database connection to a particular data source through Presto, you must specify the Presto cluster and catalog in the SQLAlchemy URI as follows: presto://<presto-username>:<presto-password>@<presto-coordinator-url>:<http-server-port>/<catalog>.

Superset and SQLAlchemy

Superset is built as a Python Flask web application and leverages SQLAlchemy, a Python SQL toolkit, to provide a consistent abstraction layer to relational data sources. Superset uses a consistent SQLAlchemy URI as a connection string for a defined Superset database. The schema for the URI is as follows: dialect+driver://username:password@host:port/database. We will deconstruct the dialect, driver, and database in the following sections.

SQLAlchemy defines a dialect as the system it uses to communicate with the specifics various databases (e.g. flavor of SQL) and DB-API, low level Python APIs to talk to specific relational data sources. A Python DB-API database driver is required for a given data source. For example, PyHive is a DB-API driver to connect to Presto. It is possible for a single dialect to choose between multiple DB-API drivers. For example, PostgreSQL dialect can support the following DB-API drivers: psycopg2, pg8000, psycop2cffi, an pygresql. Typically, a single DB-API driver is set as the default for a dialect and used when no explicit DB-API is specified. For PostgreSQL, the default DB-API driver is psycopg2.

The term database can be confusing since it is heavily loaded. In a typical scenario a given data source, such as PostgeSQL, have multiple logical groupings of tables which are called “databases”. In a way, these “databases” provide namespaces for tables; identically named tables can exist in two different “databases” without collision. As an example, we can use the PostgreSQL instance available when locally installing Superset with Docker Compose.

In this instance of PostgreSQL, we have four databases: postgres, superset, template0, and template1.

superset@localhost:superset> \\l

+-----------+----------+----------+------------+------------+-----------------------+
| Name      | Owner    | Encoding | Collate    | Ctype      | Access privileges     |
|-----------+----------+----------+------------+------------+-----------------------|
| postgres  | superset | UTF8     | en_US.utf8 | en_US.utf8 | <null>                |
| superset  | superset | UTF8     | en_US.utf8 | en_US.utf8 | <null>                |
| template0 | superset | UTF8     | en_US.utf8 | en_US.utf8 | =c/superset           |
|           |          |          |            |            | superset=CTc/superset |
| template1 | superset | UTF8     | en_US.utf8 | en_US.utf8 | =c/superset           |
|           |          |          |            |            | superset=CTc/superset |
+-----------+----------+----------+------------+------------+-----------------------+

We can look into the superset database and see the tables in that database.

The key thing to remember here is that ultimately a Superset database needs to resolve to a collection of tables, whatever that is referred to in a particular dialect.

superset@localhost:superset> \c superset

You are now connected to database "superset" as user "superset"

+--------+----------------------------+-------+----------+
| Schema | Name                       | Type  | Owner    |
|--------+----------------------------+-------+----------|
| public | Clean                      | table | superset |
| public | FCC 2018 Survey            | table | superset |
| public | ab_permission              | table | superset |
| public | ab_permission_view         | table | superset |
| public | ab_permission_view_role    | table | superset |
| public | ab_register_user           | table | superset |
| public | ab_role                    | table | superset |
| public | ab_user                    | table | superset |
| public | ab_user_role               | table | superset |
| public | ab_view_menu               | table | superset |
| public | access_request             | table | superset |
| public | alembic_version            | table | superset |
| public | alert_logs                 | table | superset |
| public | alert_owner                | table | superset |
| public | alerts                     | table | superset |
| public | annotation                 | table | superset |
| public | annotation_layer           | table | superset |
| public | bart_lines                 | table | superset |
| public | birth_france_by_region     | table | superset |
| public | birth_names                | table | superset |
| public | cache_keys                 | table | superset |
| public | channel_members            | table | superset |
| public | channels                   | table | superset |
| public | cleaned_sales_data         | table | superset |
| public | clusters                   | table | superset |
| public | columns                    | table | superset |
| public | covid_vaccines             | table | superset |
:

With an understanding of dialects, drivers, and databases under our belt, let’s solidify it with a few examples. Let’s assume we want to create a Superset database to a PostgreSQL data source and particular PostgreSQL database named mydatabase. Our PostgreSQL data source is hosted at pghost on port 5432 and we will log in as sonny (password is foobar). Here are three SQLAlchemy URIs we could use (actually inspired from the SQLAlchemy documentation):

  1. postgresql+psycopg2://sonny:foobar@pghost:5432/mydatabase We explicitly specify the postgresql dialect and psycopg2 driver.
  2. postgresql+pg8000://sonny:foobar@pghost:5432/mydatabase We use the pg8000 driver.
  3. postgresql://sonny:foobar@pghost:5432/mydatabase We do not explicitly list any driver, and hence, SQLAlchemy will use the default driver, which is psycopg2 for postgresql.

Superset lists its recommended Python packages for database drivers in the public documentation.

Presto Catalogs

Because Presto can connect to multiple data sources, when connecting to Presto as a defined Superset database, it’s important to understand what you are actually making a connection to.

In Presto, the equivalent notion of a “database” (i.e. logical collection of tables) is called a schema. Access to a specific schema (“database”) in a data source, is defined in a catalog.

As an example, the listing below is the equivalent catalog configuration to connect to the example mydatabase PostgreSQL database we described previously. If we were querying a table in that catalog directly from Presto, a fully-qualified table would be specified as catalog.schema.table (e.g. select * from catalog.schema.table). Hence, querying the Clean table would be select * from postgresql.mydatabase.Clean.

connector.name=postgresql
connection-url=jdbc:postgresql://pghost:5432/mydatabase
connection-user=sonny
connection-password=foobar

Superset to Presto

Going back to Superset, to create a Superset database to connect to Presto, we specify the Presto dialect. However, because Presto is the intermediary to an underlying data source, such as PostgreSQL, the username and password we need to provide (and authenticate against) is the Presto username and password. Further, we must specify a Presto catalog for the database in the SQLAlchemy URI. From there, Presto—-through its catalog configuration—-authenticates to the backing data source with the appropriate credentials (e.g sonny and foobar ). Hence, the SQLAlchemy URI to connect to Presto in Superset is as follows: presto://<presto-username>:<presto-password>@<presto-coordinator-url>:<http-server-port>/<catalog>

The http-server-port refers to the http-server.http.port configuration on the coordinator and workers (see Presto config properties); it is usually set to 8080.

New Superset Database Connection UI

In Superset 1.3, there is a feature-flagged version of a new database connection UI that simplifies connecting to data without constructing the SQLAlchemy URI. The new database connection UI can be turned on in config.py with FORCE_DATABASE_CONNECTIONS_SSL = True (PR #14934). The new UI can also be viewed in the Superset documentation.

Try It Out!

In less than 30 minutes, you can get up and running using Superset with a Presto cluster with Ahana Cloud for Presto. Ahana Cloud for Presto is an easy-to-use fully managed Presto service that also automatically stands up a Superset instance for you. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

Presto Tutorial 103: PrestoDB cluster on GCP

Introduction

This tutorial is Part III of our Getting started with PrestoDB series. As a reminder, Prestodb is an open source distributed SQL query engine. In tutorial 102 we covered how to run a three node prestodb cluster on a laptop. In this tutorial, we’ll show you how to run a prestodb cluster in a GCP environment using VM instances and GKE containers.

Environment

This guide was developed on GCP VM instances and GKE containers.

Presto on GCP with VMs

Implementation steps for prestodb on vm instances

Step1: Create a GCP VM instance using the CREATE INSTANCE tab, name it as presto-coordinator. Next, create three more VM instances as presto-worker1, presto-worker2 and presto-worker3 respectively.

Step 2: By default GCP blocks all network ports, so prestodb will need ports 8080-8083 enabled. Use the firewalls rule tab and enable them.

Step 3: 

Install JAVA and python.

Step 4:

Download the Presto server tarball, presto-server-0.253.1.tar.gz and unpack it. The tarball will contain a single top-level directory, presto-server-0.253.1 which we will call the installation directory.

Run the commands below to install the official tarballs for presto-server and presto-cli from prestodb.io

user@presto-coordinator-1:~$ curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.235.1/presto-server-0.235.1.tar.gz
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  721M  100  721M    0     0   245M      0  0:00:02  0:00:02 --:--:--  245M
user@presto-coordinator-1:~$ curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.235.1/presto-cli-0.235.1-executable.jar
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100 12.7M  100 12.7M    0     0  15.2M      0 --:--:-- --:--:-- --:--:-- 15.1M
user@presto-coordinator-1:~$

Step 5:

Use gunzip and tar to unzip and untar the presto-server

user@presto-coordinator-1:~$gunzip presto-server-0.235.1.tar.gz ;tar -xf presto-server-0.235.1.tar

Step 6: (optional)

Rename the directory without version number

user@presto-coordinator-1:~$ mv presto-server-0.235.1 presto-server

Step 7:  

Create etc, etc/catalog and data directories

user@presto-coordinator-1:~/presto-server$ mkdir etc etc/catalog data

Step 8:

Define etc/node.config, etc/config.properties, etc/jvm.config and etc/catalog/jmx.properties files as below for presto co-ordinator server.  

user@presto-coordinator-1:~/presto-server$ cat etc/node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/home/user/presto-server/data

user@presto-coordinator-1:~/presto-server$ cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080

user@presto-coordinator-1:~/presto-server$ cat etc/jvm.config
-server-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true

user@presto-coordinator-1:~/presto-server$ cat etc/log.properties
com.facebook.presto=INFO

user@presto-coordinator-1:~/presto-server$ cat etc/catalog/jmx.properties
connector.name=jmx

Step: 9 

Check the cluster UI status. It should  show the Active worker count at 0 since we enabled only the coordinator.

Step 10: 

Repeat steps 1 to 8 on the remaining 3 vm instances which will act as worker nodes.

On the configuration step for worker nodes, set coordinator to false and http-server.http.port to 8081, 8082 and 8083 for worker1, worker2 and worker3 respectively.

Also make sure node.id and http-server.http.port are different for each worker node.

user@presto-worker1:~/presto-server$ cat etc/node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffffd
node.data-dir=/home/user/presto-server/data
user@presto-worker1:~/presto-server$ cat etc/config.properties
coordinator=false
http-server.http.port=8083
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://presto-coordinator-1:8080

user@presto-worker1:~/presto-server$ cat etc/jvm.config
-server-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true

user@presto-worker1:~/presto-server$ cat etc/log.properties
com.facebook.presto=INFO

user@presto-worker1:~/presto-server$ cat etc/catalog/jmx.properties
connector.name=jmx

Step 11: 

Check cluster status, it should reflect the three worker nodes as part of the prestodb cluster.

Step 12:

Verify the prestodb environment by running the prestodb CLI with simple JMX query

user@presto-coordinator-1:~/presto-server$ ./presto-cli
presto> SHOW TABLES FROM jmx.current;
                                                              Table                                                              
-----------------------------------------------------------------------------------------------------------------------------------
com.facebook.airlift.discovery.client:name=announcer                                                                             
com.facebook.airlift.discovery.client:name=serviceinventory                                                                      
com.facebook.airlift.discovery.store:name=dynamic,type=distributedstore                                                          
com.facebook.airlift.discovery.store:name=dynamic,type=httpremotestore                                                           
com.facebook.airlift.discovery.store:name=dynamic,type=replicator


Implementation steps for Prestodb on GKE containers

Step 1:

Go to the Google cloud Console and activate the cloud console window

Step 2:

Create an artifacts repository using the below command and replace REGION with the valid region you would prefer to create the repository.

gcloud artifacts repositories create ahana-prestodb \
   --repository-format=docker \
   --location=REGION \
   --description="Docker repository

Step 3:

Create the container cluster by using the gcloud command: 

user@cloudshell:~ (weighty-list-324021)$ gcloud config set compute/zone us-central1-c
Updated property [compute/zone].

user@cloudshell:~ (weighty-list-324021)$ gcloud container clusters create prestodb-cluster01

Creating cluster prestodb-cluster01 in us-central1-c…done.
Created 
.
.
.

kubeconfig entry generated for prestodb-cluster01.
NAME                LOCATION       MASTER_VERSION   MASTER_IP     MACHINE_TYPE  NODE_VERSION     NUM_NODES  STATUS
prestodb-cluster01  us-central1-c  1.20.8-gke.2100  34.72.76.205  e2-medium     1.20.8-gke.2100  3          RUNNING
user@cloudshell:~ (weighty-list-324021)$

Step 4:

After container cluster creation, run the following command to see the cluster’s three nodes

user@cloudshell:~ (weighty-list-324021)$ kubectl get nodes
NAME                                                STATUS   ROLES    AGE     VERSION
gke-prestodb-cluster01-default-pool-34d21367-25cw   Ready    <none>   7m54s   v1.20.8-gke.2100
gke-prestodb-cluster01-default-pool-34d21367-7w90   Ready    <none>   7m54s   v1.20.8-gke.2100
gke-prestodb-cluster01-default-pool-34d21367-mwrn   Ready    <none>   7m53s   v1.20.8-gke.2100
user@cloudshell:~ (weighty-list-324021)$

Step 5:

Pull the prestodb docker image 

user@cloudshell:~ (weighty-list-324021)$ docker pull ahanaio/prestodb-sandbox

Step 6:

Deploy ahanaio/prestodb-sandbox locally on the shell and create an image named as coordinator which will later be deployed on the container clusters.

user@cloudshell:~ (weighty-list-324021)$ docker run -d -p 8080:8080 -it –name coordinator ahanaio/prestodb-sandbox
391aa2201e4602105f319a2be7d34f98ed4a562467e83231913897a14c873fd0

Step 7:

Edit the etc/config.parameters file inside the container and set the node-scheduler.include-coordinator property to false. Now restart the coordinator.

user@cloudshell:~ (weighty-list-324021)$ docker exec -i -t coordinator bash                                                                                                                       
bash-4.2# vi etc/config.properties
bash-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery-server.enabled=true
discovery.uri=http://localhost:8080
bash-4.2# exit
exit
user@cloudshell:~ (weighty-list-324021)$ docker restart coordinator
coordinator

Step 8:

Now do a docker commit, create a tag called coordinator based on imageid, this will create a new local image called coordinator.

user@cloudshell:~ (weighty-list-324021)$ docker commit coordinator
Sha256:46ab5129fe8a430f7c6f42e43db5e56ccdf775b48df9228440ba2a0b9a68174c

user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                 TAG       IMAGE ID       CREATED          SIZE
<none>                     <none>    46ab5129fe8a   15 seconds ago   1.81GB
ahanaio/prestodb-sandbox   latest    76919cf0f33a   34 hours ago     1.81GB

user @cloudshell:~ (weighty-list-324021)$ docker tag 46ab5129fe8a coordinator

user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                 TAG       IMAGE ID       CREATED              SIZE
coordinator                latest    46ab5129fe8a   About a minute ago   1.81GB
ahanaio/prestodb-sandbox   latest    76919cf0f33a   34 hours ago         1.81GB

Step 9:

Create tag with artifacts path and copy it over to artifacts location

user@cloudshell:~ docker tag coordinator:latest us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/coord:v1

user@cloudshell:~ docker push us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/coord:v1

Step 10:

Deploy the coordinator into the cloud container using the below kubectl commands.

user@cloudshell:~ (weighty-list-324021)$ kubectl create deployment coordinator –image=coordinator
deployment.apps/coordinator created

user@cloudshell:~ (weighty-list-324021)$ kubectl expose deployment coordinator –name=presto-coordinator –type=LoadBalancer –port 8080 –target-port 8080
service/presto-coordinator exposed

user@cloudshell:~ (weighty-list-324021)$ kubectl get service
NAME                 TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)          AGE
kubernetes           ClusterIP      10.7.240.1    <none>          443/TCP          41m
presto-coordinator   LoadBalancer   10.7.248.10   35.239.88.127   8080:30096/TCP   92s

Step 11:

Copy the external IP on a browser and check the status

Step 12:

Now to deploy worker1 into the GKE container, again start a local instance named worker1 using the docker run command.

user@cloudshell:~ docker run -d -p 8080:8080 -it –name worker1 coordinator
1d30cf4094eba477ab40d84ae64729e14de992ac1fa1e5a66e35ae553964b44b
user@cloudshell:~

Step 13:

Edit worker1 config.properties inside the worker1 container to set coordinator to false and http-server.http.port to 8081. Also the discovery.uri should point to the coordinator container running inside the GKE container.

user@cloudshell:~ (weighty-list-324021)$ docker exec -it worker1  bash                                                                                                                             
bash-4.2# vi etc/config.properties
bash-4.2# vi etc/config.properties
bash-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://presto-coordinator01:8080

Step 14:

Stop the local worker1 container, commit the worker1 as image and tag it as worker1 image

user@cloudshell:~ (weighty-list-324021)$ docker stop worker1
worker1
user@cloudshell:~ (weighty-list-324021)$ docker commit worker1
sha256:cf62091eb03702af9bc05860dc2c58644fce49ceb6a929eb6c558cfe3e7d9abf
ram@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                                                            TAG       IMAGE ID       CREATED         SIZE
<none>                                                                <none>    cf62091eb037   6 seconds ago   1.81GB

user@cloudshell:~ (weighty-list-324021)$ docker tag cf62091eb037 worker1:latest
user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                                                            TAG       IMAGE ID       CREATED         SIZE
worker1                                                               latest    cf62091eb037   2 minutes ago   1.81GB

Step 15:

Push the worker1 image into google artifacts location

user@cloudshell:~ (weighty-list-324021)$ docker tag worker1:latest us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker1:v1

user@cloudshell:~ (weighty-list-324021)$ docker push us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker1:v1
The push refers to repository [us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker1]
b12c3306c4a9: Pushed
.
.
coordinator=false
v1: digest: sha256:fe7db4aa7c9ee04634e079667828577ec4d2681d5ac0febef3ab60984eaff3e0 size: 2201

Step 16:

Deploy and expose the worker1 from the artifacts location into the google cloud container using this kubectl command.

user@cloudshell:~ (weighty-list-324021)$ kubectl create deployment presto-worker01  –image=us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker1:v1                               
deployment.apps/presto-worker01 created

user@cloudshell:~ (weighty-list-324021)$ kubectl expose deployment presto-worker01 –name=presto-worker01 –type=LoadBalancer –port 8081 –target-port 8081                                       
service/presto-worker01 exposed

Step 17:

Check presto UI for successful deployment of worker1

Step 18:

Repeat steps 12 to steps 17 to deploy worker2 inside GKE container:

  • deploy ahana local instance using docker and name it as worker2, 
  • then edit the etc/config.properties inside the worker2 container to set coordinator to false, port to 8082 and discover.uri to the coordinator container name.
  • shut the instance then commit that instance and create docker image as worker2 
  • push that worker2 image to google artifacts location 
  • use kubectl commands to deploy and expose the worker2 instance inside a google container. Now check the prestodb UI for the second worker being active.
  • Check prestodb UI for successful deployment of worker2
user@cloudshell:~ (weighty-list-324021)$ docker run -d -p 8080:8080 -it –name worker2 worker1                                                                                                     
32ace8d22688901c9fa7b406fe94dc409eaf3abfd97229ab3df69ffaac00185d
user@cloudshell:~ (weighty-list-324021)$ docker exec -it worker2 bash
bash-4.2# vi etc/config.properties
bash-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8082
discovery.uri=http://presto-coordinator01:8080
bash-4.2# exit
exit
user@cloudshell:~ (weighty-list-324021)$ docker commit worker2
sha256:08c0322959537c74f91a6ccbdf78d0876f66df21872ff7b82217693dc3d4ca1e
user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                                                              TAG       IMAGE ID       CREATED          SIZE
<none>                                                                  <none>    08c032295953   11 seconds ago   1.81GB

user@cloudshell:~ (weighty-list-324021)$ docker tag 08c032295953 worker2:latest

user@cloudshell:~ (weighty-list-324021)$ docker commit worker2
Sha256:b1272b5e824fdebcfd7d434fab7580bb8660cbe29aec8912c24d3e900fa5da11

user@cloudshell:~ (weighty-list-324021)$ docker tag worker2:latest us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker2:v1

user@cloudshell:~ (weighty-list-324021)$ docker push us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker2:v1
The push refers to repository [us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker2]
aae10636ecc3: Pushed
.
.
v1: digest: sha256:103c3fb05004d2ae46e9f6feee87644cb681a23e7cb1cbcf067616fb1c50cf9e size: 2410

user@cloudshell:~ (weighty-list-324021)$ kubectl create deployment presto-worker02  –image=us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker2:v1
deployment.apps/presto-worker02 created

user@cloudshell:~ (weighty-list-324021)$ kubectl expose deployment presto-worker02 –name=presto-worker02 –type=LoadBalancer –port 8082 –target-port 8082
service/presto-worker02 exposed

user@cloudshell:~ (weighty-list-324021)$ kubectl get service
NAME                   TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)          AGE
kubernetes             ClusterIP      10.7.240.1     <none>           443/TCP          3h35m
presto-coordinator01   LoadBalancer   10.7.241.37    130.211.208.47   8080:32413/TCP   49m
presto-worker01        LoadBalancer   10.7.255.27    34.132.29.202    8081:31224/TCP   9m15s
presto-worker02        LoadBalancer   10.7.254.137   35.239.88.127    8082:31020/TCP   39s

Steps 19:

Repeat steps 12 to steps 18 to provision worker3 inside the google cloud container

user@cloudshell:~ (weighty-list-324021)$ docker run -d -p 8080:8080 -it –name worker3 worker1
6d78e9db0c72f2a112049a677d426b7fa8640e8c1d3aa408a17321bb9353c545

user@cloudshell:~ (weighty-list-324021)$ docker exec -it worker3 bash                                                                                                                              
bash-4.2# vi etc/config.properties
bash-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8083
discovery.uri=http://presto-coordinator01:8080
bash-4.2# exit
Exit

user@cloudshell:~ (weighty-list-324021)$ docker commit worker3
sha256:689f39b35b03426efde0d53c16909083a2649c7722db3dabb57ff0c854334c06
user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                                                              TAG       IMAGE ID       CREATED          SIZE
<none>                                                                  <none>    689f39b35b03   25 seconds ago   1.81GB
ahanaio/prestodb-sandbox                                                latest    76919cf0f33a   37 hours ago     1.81GB

user@cloudshell:~ (weighty-list-324021)$ docker tag 689f39b35b03 worker3:latest

user@cloudshell:~ (weighty-list-324021)$ docker tag worker3:latest us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker3:v1

user@cloudshell:~ (weighty-list-324021)$ docker push us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker3:v1
The push refers to repository [us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker3]
b887f13ace4e: Pushed
.
.
v1: digest: sha256:056a379b00b0d43a0a5877ccf49f690d5f945c0512ca51e61222bd537336491b size: 2410

user@cloudshell:~ (weighty-list-324021)$ kubectl create deployment presto-worker03  –image=us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker3:v1
deployment.apps/presto-worker03 created

user@cloudshell:~ (weighty-list-324021)$ kubectl expose deployment presto-worker02 –name=presto-worker03 –type=LoadBalancer –port 8083 –target-port 8083
service/presto-worker03 exposed



Step 20:

Verify the prestodb environment by running the prestodb CLI with simple JMX query

user@presto-coordinator-1:~/presto-server$ ./presto-cli
presto> SHOW TABLES FROM jmx.current;
                                                              Table                                                              
———————————————————————————————————————————–
com.facebook.airlift.discovery.client:name=announcer                                                                             
com.facebook.airlift.discovery.client:name=serviceinventory                                                                      
com.facebook.airlift.discovery.store:name=dynamic,type=distributedstore                                                          
com.facebook.airlift.discovery.store:name=dynamic,type=httpremotestore                                                           
com.facebook.airlift.discovery.store:name=dynamic,type=replicator

Summary

In this tutorial you learned how to  provision and run prestodb inside Google VM instances and on GKE containers. Now you should be able to validate the functional aspects of prestodb. 

If you want to run production Presto workloads at scale and performance, check out https://www.ahana.io which provides a managed service for Presto.

How to use mathematical functions and operators and aggregate functions for Presto?

Presto offers several classes of mathematical functions that operate on single values and mathematical operators that allow for operations on values across columns. In addition, aggregate functions can operator on a set of values to compute a single result.

The mathematical functions are broken into four subcategories: 1. mathematical, 2. statistical, 3. trigonometric, and 4. floating point. The majority fall into the mathematical category and we’ll discuss them separately. The statistical functions are quite sparse with two functions that compute the lower and upper bound of the Wilson score interval of a Bernoulli process. The trigonometric functions are what you’d expect (e.g. sin, cos, tan, etc.). The floating point functions are really functions that handle not-a-number and infinite use cases.

The mathematical functions subcategory further fall into another layer of classification:

  1. Functions that perform coarser approximation, such as rounding and truncation: abs, ceiling (ceil), floor, round, sign, truncate
  2. Conversions: degrees, radians, from_base, to_base
  3. Exponents, logarithms, roots: exp, ln, log2, log10, power (pow), cbrt, sqrt
  4. Convenient constants, such as pi(), e(), random (rand)
  5. Cumulative distribution functions (and inverses):binomial_cdf, inverse_binomial_cdf, cauchy_cdf, inverse_cauchy_cdf, chi_squared_cdf, inverse_chi_squared_cdf, normal_cdf, inverse_normal_cdf, poisson_cdf, inverse_poisson_cdf, weibull_cdf, inverse_weibull_cdf, beta_cdf, inverse_beta_cdf, width_bucket
  6. Miscellaneous: mod, cosine_similarity

The mathematical operators are basic arithmetic operators, such as addition (+), subtraction (-), multiplication (*), and modulus (%).

Let’s apply these mathematical functions in an example. In the following query, have a floating-point column x to which we apply several mathematical functions that are representative of the subcategories we discussed previously, including: radians (conversion), natural log, the Normal CDF, modulo, random number, and operators.

select
	x,
	radians(x) as radians_x,							/* convert to radians */
	ln(x) as ln_x,												/* natural log */
	normal_cdf(0,30,x) as_normal_cdf_x,		/* Normal CDF */
	mod(x,2) as mod_x_2,									/* Modulo 2 */
	random() as r,												/* Random number */
	3*((x/2)+2) as formula								/* Formula using operators */
from
	example;

The following is the output the above query with some rounding for ease of viewing.

So, far we see that mathematical functions, as they are classified in Presto, operate on single values. What this means is that given a column of values, each function is applied element-wise to that column. Aggregate functions allow us to look across a set of values.

Like mathematical functions, aggregate functions are also broken into subcategories: 1. general, 2. bitwise, 3. map, 4. approximate, 5. statistical, 6. classification metrics, and 7. differential entropy. We will discuss the general and approximate subcategory separately.

The bitwise aggregate functions are two functions that return the bitwise AND and bitwise OR or all input values in 2’s complement representation. The map aggregate functions provide convenient map creation functions from input values. The statistical aggregate functions are standard summary statistic functions you would expect, such as stddev, variance, kurtosis, and skewness. The classification metrics and differential entropy aggregate functions are specialized functions that make it easy to analyze binary classification predictive modelling and model binary differential entropy, respectively.

The general functions subcategory further fall into another layer of classification:

  1. Common summarizations: count, count_if, min, max, min_by, max_by, sum, avg, geometric_mean, checksum
  2. Boolean tests: bool_or, bool_and, every
  3. Data structure consolidation: array_agg, set_agg, set_union
  4. Miscellaneous: reduce_agg, arbitrary

Again, let’s apply these aggregate functions in a series of representative examples. In the following query, we apply a series of basic aggregations to our floating-point column x .

select
	sum(x) as sum_x,
	count(x) as count_x,
	min(x) as min_x,
	max(x) as max_x,
	avg(x) as avg_x,
	checksum(x) as ckh_x
from
	example;

The following is the output the above query.

In the following query, we showcase a boolean test with the bool_or function. We know that the natural log will return a NaN for negative values of x. So, if we apply the is_nan check, we expect x to always be false, but for our ln result to occasionally be true. Finally, if we were to do the bool_or aggregation on our is_nan functions, we expect the column derived from x to be false (i.e. no true at all) and the column derived fro ln(x) to be true (i.e. at least one true value). The following query and accompanying result illustrate this.

with nan_test as (
	select
		is_nan(x) as is_nan_x,
		is_nan(ln(x)) as is_nan_ln_x
	from
		example
)
select
	bool_or(is_nan_x) as any_nan_x_true,
	bool_or(is_nan_ln_x) as any_nan_ln_x_true
from
	nan_test;

This final example illustrates the use of an example of data consolidation, taking a x and radians(x) columns and creating a single row with a map data structure.

with rad as(select x, radians(x) as rad_x from example)
select map_agg(x, rad_x) from rad;

The approximate aggregate functions provide approximate results for aggregate large data sets, such as distinct values (approx_distinct), percentiles (approx_percentile), and histograms (numeric_histogram). In fact, we have a short answer post on how to use the approx_percentile function. Several of the approximate aggregate functions rely on other functions and data structures: quantile digest, HyperLogLog and KHyperLogLog.

A natural extension to aggregate functions are window functions, which perform calculations across rows of a query results. In fact, all aggregate functions can be used as window functions by adding an OVER clause. One popular application of window functions is time-series analysis. In particular, the lag function window function is quite useful. We have a short answer post on how to use the lag window function and to compute differences in dates using the lag window function.

This was short article was a high-level overview, and you are encouraged to review the Presto public documentation for Mathematical Functions and Operations and Aggregate Functions. If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

Ahana Cofounder Will Co-lead Session At OSPOCon 2021 About Presto SQL Query Engine

San Mateo, Calif. – September 21, 2021 — Ahana, the SaaS for Presto company, today announced that its Cofounder and Chief Product Officer Dipti Borkar will co-lead a session with Facebook Software Engineer Tim Meehan at OSPOCon 2021 about Presto, the Facebook-born open source high performance, distributed SQL query engine. The event is being held September 27-30 in Seattle, WA and Virtual.

Session Title: “Presto – Today and Beyond – The Open Source SQL Engine for Querying All Data Lakes.”

Session Time: Wednesday, September 29 at 3:55pm – 4:45pm PT

Session Presenters: Ahana Cofounder and Chief Product Officer and Presto Foundation Chairperson, Outreach Team, Dipti Borkar; and Facebook Software Engineer and Presto Foundation Chairperson, Technical Steering Committee, Tim Meehan. 

Session Details: Born at Facebook, Presto is an open source high performance, distributed SQL query engine. With the disaggregation of storage and compute, Presto was created to simplify querying of all data lakes – cloud data lakes like S3 and on premise data lakes like HDFS. Presto’s high performance and flexibility has made it a very popular choice for interactive query workloads on large Hadoop-based clusters as well as AWS S3, Google Cloud Storage and Azure blob store. Today it has grown to support many users and use cases including ad hoc query, data lake house analytics, and federated querying. This session will give an overview on Presto including architecture and how it works, the problems it solves, and most common use cases. Dipti and Tim will also share the latest innovation in the project as well as the future roadmap.

To register for Open Source Summit + Embedded Linux Conference + OSPOCon 2021, please go to the event registration page to purchase a registration.

TWEET THIS: @Ahana to present at #OSPOCon 2021 about Presto #Presto #OpenSource #Analytics #Cloud https://bit.ly/3AnfAMl

About Ahana

Ahana, the Presto company, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Announcing the workload profile feature in Ahana Cloud

Ahana Cloud for Presto is the first fully integrated, cloud native managed service that simplifies the ability of cloud and data platform teams. With the managed Presto service, we provide a lot of tuned configurations out of the box for Ahana customers.

We’re excited to announce that the workload profile feature is now available on Ahana Cloud. With this release, users can create a cluster with a validated set of configurations that suits the type of workloads or queries users plan to run.

Today, the Presto clusters are configured with default properties that work well for generic workloads. However, achieving workload-specific tuning and resource allocation requires a good understanding of presto’s resource consumption for that workload. Further, to change or add any property, we first update the configuration file and then restart the cluster. This makes data platform users spend a couple of days and in many cases weeks iterating, evaluating, and experimenting with the config tuning to reach the ideal configuration, specific to the workloads. To solve this pain point and to deliver predictable performance at scale, Ahana Cloud allows user to select tuned set of properties for desired workloads in a one click away.

Here is the short demo of creating Presto Cluster with a workload profile:

Concurrent queries are simply the number of queries executing at the same time in a given cluster. To simplify this experience we have classified workloads based on the number of concurrent queries and curated a set of tuned session properties for each profile.

Low concurrency is useful for clusters that run a limited number of queries or a few large, complex queries. It also supports bigger and heavier ETL jobs.

High concurrency is better for running multiple queries at the same time. For example, dashboard and reporting queries or A/B testing analytics, etc.

This setting can be changed once the cluster has been created and cluster restart is not required. However, the change will only apply to new queries. Following is the short demo on how you can change these profiles for running clusters.

This feature is the beginning of auto-tune capabilities for workload management. We are continuously innovating Ahana Cloud for our customers and to deliver a seamless, easy experience for data teams looking to leverage the power of Presto. Please give it a try and log in to the Ahana Cloud console to get started. We have a free trial as well that you can sign up for today.

Webinar On-Demand
Presto on AWS: Exploring different Presto services

Presto is a widely adopted distributed SQL engine for data lake analytics. Running Presto in the cloud comes with many benefits – performance, price, and scale are just a few. To run Presto on AWS, there are a few services you can use to do that: EMR Presto, Amazon Athena, and Ahana Cloud.

In this webinar, Asif will discuss these 3 approaches, the pros and cons of each, and how to determine which service is best for your use case. He’ll cover:

  • Quick overview of EMR Presto, Athena, and Ahana
  • Benefits and limitations of each
  • How to pick the best approach based on your needs

If you’re using or evaluating Presto today, register to learn more about running Presto in the cloud.

Speaker

Asif Kazi
Principal Solutions Engineer, Ahana

Ahana Joins AWS ISV Accelerate Program to Expand Access to Its Presto Managed Service for Fast SQL on Amazon S3 Data Lakes

Ahana also selected into the invite-only AWS Global Startup Program

San Mateo, Calif. – September 14, 2021 — Ahana, the Presto company, today announced it has been accepted into the AWS ISV Accelerate Program. Ahana Cloud for Presto was launched in AWS Marketplace in December. As a member of the AWS ISV Accelerate Program, Ahana will be able to drive new business and accelerate sales cycles by co-selling with AWS Account Managers who are the trusted advisors in most cases.

Ahana has also been selected into the AWS Global Startup Program, an invite-only, go-to-market program built to support startups that have raised institutional funding, achieved product-market fit, and are ready to scale.

“Traditional warehouses were not designed to hold all of a company’s data. Couple that with rising compute and storage costs associated with the warehouse, and many companies are looking for an alternative,” said Steven Mih, Cofounder and CEO, Ahana, “Open source Presto is that alternative, making it easy to run SQL queries on Amazon S3 data lakes at a much lower price-point. Using Presto as a managed service for open data lake analytics makes it easy to use SQL on Amazon S3, freeing up data platform teams for mission critical, value-add work. Ahana’s acceptance into the AWS ISV Accelerate Program and AWS Global Startup Program will allow us to be better aligned with AWS Account Managers who work closely with AWS customers, to drive adoption of Ahana Cloud for Presto and help more organizations accelerate their time to insights.”

Securonix uses Ahana Cloud for Presto for SQL analytics on their Amazon S3 data lake. They are pulling in billions of events per day, and that data needs to be searched for threats. With Ahana Cloud on AWS, Securonix customers can identify threats in real-time at a reasonable price.

“Before Presto we were using a Hadoop cluster, and the challenge was on scale…not only was it expensive but the scaling factors were not linear,” said Derrick Harcey, Chief Architect at Securonix. “The Presto engine was designed for scale, and it’s feature-built just for a query engine. Ahana Cloud on AWS made it easy for us to use Presto in the cloud.”

Ahana’s acceptance into the AWS ISV Accelerate Program enables the company to meet customer needs through collaboration with the AWS Sales organization. Collaboration with the AWS Sales team enables Ahana to provide better outcomes to customers.

Supporting Resources

Learn more about the AWS ISV Accelerate Program

TWEET THIS: @Ahana joins AWS ISV Accelerate Program, AWS Global Startup Program  https://bit.ly/3A9JdR0 #Presto #OpenSource #Analytics #Cloud

About Ahana

Ahana, the Presto company, offers managed service for Presto with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter, and thousands more, is a standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Ahana 101: An introduction to Ahana Cloud for Presto on AWS, SaaS for Presto on AWS

Webinar On-Demand

Presto is the fastest growing query engine used by companies like Facebook, Uber, Twitter and many more. While powerful, Presto can be complicated to run on your own especially if you’re a smaller team that may not have the skillset.

That’s where Ahana comes in. Ahana Cloud is SaaS for Presto, giving teams of all sizes the power to deploy and manage Presto on AWS. Ahana takes care of hundreds of deployment and management configurations of Presto including attaching/detaching external data sources, configuration parameters, tuning, and much more.

In this webinar Ram will discuss why companies are using Ahana Cloud for their Presto deployments and give an overview of Ahana including:

  • The Ahana SaaS console
  • How easy it is to add data sources like AWS S3 and integrate catalogs like Hive
  • Features like Data Lake Caching for 5x performance and autoscaling

Speaker

Ram Upendran
Technical Product Marketing Manager, Ahana

Presto 101: An introduction to open source Presto

Webinar On-Demand

Presto is a widely adopted distributed SQL engine for data lake analytics. With Presto, you can perform ad hoc querying of data in place, which helps solve challenges around time to discover and the amount of time it takes to do ad hoc analysis. Additionally, new features like the disaggregated coordinator, Presto-on-Spark, scan optimizations, a reusable native engine, and a Pinot connector enable added benefits around performance, scale, and ecosystem.

In this session, Dipti will introduce the Presto technology and share why it’s becoming so popular – in fact, companies like Facebook, Uber, Twitter, Alibaba, and much more use Presto for interactive ad hoc queries, reporting & dashboarding data lake analytics, and much more. We’ll also show a quick demo on getting Presto running in AWS.

Speaker

Dipti Borkar
Cofounder and Chief Product Officer, Ahana

What is a Presto lag example?

The Presto lag function a window function that returns the value of an offset before the current row in a window. One common use case for the lag function is with time series analysis, such as autocorrelation.

Figure 1 shows the advert table of sales and advertising expenditure from Makridakis, Wheelwright and Hyndman (1998) Forecasting: methods and applications, John Wiley & Sons: New York. The advert column is the monthly advertising expenditure, and the sales column is the monthly sales volume.

/

A simple analysis could be to track the difference between the current month’s sales volume and the previous one, which is shown in Figure 2. The lag_1_sales column is a single period lagged value of the sales column, and the diff column is the difference between sales and lag_1_sales. To generate the table in Figure 2, we can use the lag function and the following query:

select
  advert,
  sales,
  lag_1_sales,
  round(sales - lag_1_sales,2) as diff
from (
  select
    advert,
    sales,
    lag(sales, 1) over(range unbounded preceding) as lag_1_sales
  from advert
);


The subquery uses the lag function to get a one period offset preceding value of the sales column, where the OVER clause syntax is specifying the window. The main query then computes the diff column. Here are a couple of additional useful notes about the lag function:

  1. You can change the offset with the second argument lag(x, OFFSET), where OFFSET is any scalar expression. The current row is OFFSET=1.
  2. By default, if an offset value is null or outside the specified window, a NULL value is used. We can see this in the first row of the table in Figure 2. However, the default value to use in these cases is configurable with an optional third argument lag(x, OFFSET, DEFAULT_VALUE), where DEFAULT_VALUE is desired value.

A closely related function is the lead function returns the value at an offset after the current row.

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

SQL on the Data Lake, Using open source Presto to unlock the value of your data lake

Webinar On-Demand

While data lakes are widely used and have become extremely affordable as data volumes have grown, they are only meant for storage and by themselves provide no direct value. With up to 80% of data stored in the data lake today, how do you unlock the value of the data lake? The value lies in the compute engine, or more commonly the SQL engine, that runs on top of a data lake.

In this webinar, Dipti will discuss why open source Presto has quickly become the de-facto query engine for the data lake. Presto enables ad hoc data discovery where you can use SQL to run queries whenever you want, wherever your data resides. With Presto, you can unlock the value of your data lake.

She will cover:

  • An overview of Presto and why it emerged as the best engine for the data lake
  • How to use Presto to run ad hoc queries on your data lake
  • How you can get started with Presto on AWS S3 today

Speaker

Dipti Borkar
Cofounder and Chief Product Officer, Ahana

How do I get the date_diff from previous rows?

To find the difference in time between consecutive dates in a result set, Presto offers window functions. Take the example table below which contains sample data of users who watched movies.

Example:

select * from movies.ratings_csv limit 10;

select userid, date_diff('day', timestamp, lag(timestamp) over (partition by userid order by  timestamp desc)) as timediff from ratings_csv order by userid desc limit 10;

The lag(x, y, start, end) function fetches the value of column x at row offset y and calculates the difference. When no offset is provided, the default value is 1 (previous row). Notice, that the first row in timediff is NULL due to not having a previous row.

Webinar On-Demand
Data Warehouse or Data Lake, which one do I use?

Slides

Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). 

There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization.

In this webinar, you’ll hear from industry analyst John Santaferraro and Ahana cofounder and CPO Dipti Borkar who will discuss the data landscape and how many companies are thinking about their data warehouse/data lake strategy. They’ll share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lake.


Webinar Transcript

SPEAKERS

John Santaferraro | Industry Analyst, Dipti Borkar | CPO & Co-Founder Ahana, Ali LeClerc | Moderator, Ahana

Ali LeClerc | Ahana 

Hi, everybody, welcome to today’s webinar, Data Warehouse or Data Lake, or which one do I use? My name is Ali, and I will be moderating the event today. Before we get started, and I introduce our wonderful speakers, just a few housekeeping items, one, this session is being recorded. If you miss any parts of it, if you join late, you’ll get a link to both the recording and the slides that we are going through today. Second, we have allotted some time for questions at the end. So please feel free to pop in your questions, there is a questions tab in your GoToWebinar panel. You can go ahead, ask away during the session itself, and we’ll get to them at the end.

So, without further ado, I am pleased to introduce our two speakers, we have John Santaferraro and Dipti Borkar. John is an industry analyst who has been doing this for over 26-years has a ton of experience in the space. So, looking forward to hearing his perspective. And then we’re also joined by Dipti. Dipti Borkar is the co-founder and CPO of Ahana, has a ton of experience in relational and non-relational database engines as well as the analytics market.

Today they’re going to be talking about data warehouse or data lake. So, with that, John, I will throw things over to you, please take it away.

John Santaferraro | Industry Analyst 

Awesome. Really excited to be here. This topic is top of mind, I think for everyone, we’re going to take a little look at history, you know, where did data lakes and data warehouses start? How have they been modernized? What’s going on in this world where they seem to be merging together? And then give you some guidance on what are use cases for these modern platforms? How do you know, how do you choose a data lake or a data warehouse? Which one do you choose? So, we’re going to provide some criteria for that, we’re going to look at the Uber technical case study and answer any questions that you guys have.

So, I’m going to jump right in I actually got into the data warehousing world all the way back in 1995. I co-founded a data warehouse startup company and eventually that sold to Teradata. Right now, I’m thinking back to those times. And really, the whole decade after that the traditional data warehouse was it was a relational database. Typically, with a columnar structure, although some of the original data warehouses didn’t have that, they had in database analytics for performance focused really only on structured data. The data had to be modeled. And data modeling, for a lot of folks was an endless task. And there was the whole ETL process was 70% of every project, extracting from all your source systems, transforming it, loading it into the data warehouse. There was primarily SQL access, and these data warehouses tended to be a few sources, one or two outputs, but they were expensive, slow, difficult to manage. They provided access to limited data. So, there were a lot of challenges, a lot of benefit as well, but a lot of challenges with the traditional data warehouses. So, the data lakes came along and initially Hadoop, you’ll remember this, was going to replace the data warehouse, right?

I remember reading articles about how the data warehouse is dead. This was the introduction of Hadoop with its file system data storage, suddenly, it was inexpensive to load data into the data lake. So, all data went in there, including semi-structured data, unstructured data, it was all about ingestion of data motored in [inaudible] the structure, once it had been loaded. Don’t throw anything out. Primary use cases were for discovery, text analytics, data science. Although there was some SQL access, initially, notebooks and Python, and other languages became the primary way to access. These data lakes were less expensive, but there was limited performance on certain kinds of complex analytics. Most of the analytics folks focused on unstructured data. There was limited SQL access, and they tended to be difficult to govern. Hadoop initially didn’t have all of the enterprise capabilities.

You know, Dipti your around through a lot of that what are your some of your memories about data lakes when they first showed up on the scene?

Dipti Borkar | Ahana 

Yeah. It’s great to do this with you, John, we’ve been at this for a while. I started my career in traditional databases as well, DB2 distributed, core storage and indexing kernel engineering. And we saw this entire movement of Hadoop. What it helped with is, in some ways, the separation of storage and compute. For the first time, it really separated the two layers where storage was HDFS, and then compute went through many generations. Even just in the Hadoop timeframe was MapReduce, Hive, variations of hive and so on. But what happened is the, you know, I feel like the company is the leaders that were driving Hadoop never really simplified it for the platform teams.

Technology is great to build, but if it’s complicated, and if it takes a long time to get value from, no matter how exciting it is, it doesn’t serve its purpose. And that was the biggest challenge with Hadoop, there were 70 different projects, that took six to nine months to integrate into, and to see real value or insights from the data in HDFS, and many of these projects didn’t go well. And that’s why over time, people struggled with it, we were, we’ll talk a little bit about cloud and how the cloud migration is playing such a big role in the in the modernization of these data lakes. So just some perspectives there.

John Santaferraro | Industry Analyst 

Yeah, you know, you just you reminded me as well, the other positive thing is that I think that we had seen open source as an operating system. And with the introduction of Hadoop, there was a massive uptake of acceptance and adoption around open source technology as well. So that was another real positive during that time. So, what we’ve what we’ve seen since the inception of the data warehouse, and you know, the incursion of Hadoop into the marketplace and the data lake, we’ve seen a very rapid modernization of those platforms driven by three things.

One is digital transformation, everything has moved to digital now, especially, massive uptake of mobile technology, internet technology is way more interactive and engaging than when it used to be informational, and tons of more data and data types. Along with that, there is an increasing need for engagement with customers and real time engagement with employees, engagement with partners, everything is moving closer and closer to either just in time or real time. And so that’s created the need to be able to respond quickly to business events of any kind. And I think third, we’re really on the cusp of seeing everything automated.

Obviously, we in the world of robotics, there are there are massive manufacturing plants, where everything is now automated. And that’s being transferred over to this world of robotic process automation. In order to automate everything that requires incredible intelligence delivered to machines, and sensors, and all kinds of, you know, every kind of device that you can imagine on the internet of things in order to automate everything. And so, these, these trends have really pushed us to the modernization of both the data warehouse and the data lake.

And interestingly enough, you can look at the slide that I popped up, but modernization is happening in all of these different areas, in both the data warehouse and the data lake. The most modern of both are cloud first. There is a there was a move to in-memory capabilities. On the Data Warehouse side, they’re now bringing in more complex data types that were typically only handled on the data lake and the modern data lake is bringing in columnar data types and with great performance. Now both have the separation of compute and storage. So, you can read the rest of them here. The interesting thing about the modernization movement is that that both the data warehouse and the data lake are being modernized.

What trends are you seeing and modernization, Dipti? I kind of tend to approach this at a high level looking at capabilities. I know you see the technology underneath and go deep in that direction. What’s your take on this?

Dipti Borkar | Ahana 

Yeah, absolutely. I mean, cloud first is really important. There’s a lot of companies that are increasingly just in the cloud, many are born in the cloud, like Ahana, but also their entire infrastructure is in the cloud. It could be multiple clouds, it could be a single cloud. That’s one of the you know, one of the aspects. The other aspect is within on clouds. containerization. A very, very big trend. A few years ago, Kubernetes wasn’t as stable. And so now today, the way we’ve built Ahana as cloud first, and it runs completely on Kubernetes. And it’s completely containerized. To help with the flexibility of the cloud and the availability, the scalability and leveraging some of those aspects. I think the other aspect is open formats.

Open formats are starting to play a big role. With the data lake, and I call it open data lakes, for a variety of reasons. Open formats is a big part of it. Formats like Apache ORC, Apache Parquet, they can be consumed by many different engines. In one way, you’re not locked into a specific technology, you can actually move from one engine to another, because many of them support it. Spark supports it, Presto supports it, TensorFlow just added some support as well. With a data lake, you can have open formats, which are highly performant, and have multiple types of processing on top of it. So, these are some of the trends that I’m seeing, broadly, on the data lake side.

And, of course, the data warehouses are trying to expand and extend themselves to the data lake as well. But what happens is, when you have a core path, a critical path for any product, it’s built for a specific format, or a specific type of data. We’ve seen that with data warehouses, most of the time it’s proprietary formats. And S3, and these cloud formats, might be an extension. And for data lakes, the data lake engines, are actually built for the open formats, and not for some of these proprietary formats.

These are some of the decisions and considerations that users need to think about in terms of what’s important for them. What kind of information they want to store – overtime, historical data – in their in their data lake or data warehouse? And how open do they want it to be?

John Santaferraro | Industry Analyst 

Yeah, I think you bring up a good point too, in that the modernization of the data lake has really opened up the opportunity for storage options. And specifically, lower cost storage options and storage tiering. So that in that environment, customers can choose where they want to store their data. If they need high performance analytics, then it goes in optimized storage of some kind. If what they need is massive amount of data, then they can store still in file systems. But the object storage, simple storage options are much less costly, and I think we’re I think we’re actually moving towards a world where, at some point in the future, companies will be able to store their data inexpensively, in one place, in one format, and use it endless number of times.

I think that’s the direction that things are going as we look at modernization.

Dipti Borkar | Ahana 

Absolutely. Look at S3, as the cloud, the most widely used, cloud store, it’s 15 years in the making. So, trillions of objects that are in S3. And now that it’s ubiquitous, and it’s so cheap, most of the data is landing there. So that’s the first place the data lands. And users are thinking about okay, once it lands there, what can I do with it? Do I have to move it into another system for analysis? And that might be the case as you said, there will be extremely low latency requirements, in some cases, where it might need to be in a warehouse.

Or it might need to be – of course, you know, operational systems will always be there – here we’re talking about analytics. And what other processing can I run directly on top of S3 and on top of these objects? Without moving the data around. So that I get the cost benefits, which, which AWS has driven through, it’s very cheap to store data now, and so can I have compute on top? Essentially to do structured analysis to do semi-structured analysis, or even unstructured analysis with machine learning, deep learning and so on.

S3 and the cloud migration, I would say, has played a massive role in, in this in the adoption of data lakes, and the move towards the modern data lake that you have here.

John Santaferraro | Industry Analyst 

So, you at Ahana, you guys talk about this move from data to insight and the idea of the SQL query engine. Do you want to walk us through this Dipti?

Dipti Borkar | Ahana 

Yeah, absolutely. I touched on some of the different types of processing that’s possible on top of data lakes. One of those workloads, is SQL workload. Data warehouses and data, lakes are sitting next to each other, hey sit next to each other. You have, in the data warehouse, you have, obviously you have your storage and your compute. And typically, these are in the 10s of terabytes. That’s the most of the most of the data warehouses tend to be along that that dimension of scale. But as the amount of data has increased, and the types of information has increased, some of them are contributing to a lot more data, IoT data. Device data, third party data, behavioral data. So it used to be just enterprise data, it used to be, orders line items, or when you look at a benchmark like TPC DS is very, very simple. It’s enterprise data. But now we have a lot more data. And that is that is leading to storage, and all of this information, going into the data lake. And the terabytes are now becoming petabytes, even for small companies. So that’s where the cost factor becomes very, very important.

Lower costs are what users are looking for, and infrastructure workloads on the top of that. Presto has come up as one of the really great engines for SQL processing on top of data lakes. It can also query other data sources from like MySQL RDS, and so on. But Presto came out of Facebook as a replacement for hive, which was essentially built for the data lake. And so, reporting and dashboarding use cases come great use cases of on top of Presto is interactive use cases, I would say. There’s also an ad hoc querying, use case that’s increasing. Most often, we’re seeing this with SQL notebooks – with Jupiter or, or Zeppelin and others – and then there’s also a data transformation workloads that run on top of the data lakes. Presto is, good for that. But there are other engines like Spark, for example, that actually do a great job.

They’re built for the ETL, or in database in data lake and a transformation and they play a big role in in these workloads that run on top of the data lake. So, what we’re seeing as we talk to users, Data Platform teams, Data Platform engineers, is there are a couple of paths. If they are pre-warehouse, and I call them kind of pre-warehouse users, where they’re not on a data warehouse yet. They’re still perhaps running a Tableau or Looker, or MySQL or Postgres, you now have a choice, for the first time, where you can actually run some of these workloads on a data lake for slightly lower costs because data warehouses could be cost prohibitive. Or one approach was augment the data warehouse. And so you start off with a data warehouse, you might have some critical data in there where you need very low latencies. It is a tightly coupled system. And so you’re going to get good performance. If you don’t need extreme performance, if you don’t have that as a criteria, the data lake option becomes a very real option today, because the complexity of Hadoop has sort of disappeared.

And there are now much more simpler solutions that exist from a transformation perspective, like managed services for Spark, as well as for my interactive querying and ad hoc [inaudible] perspective, managed services for Presto. That’s what we’re seeing in some cases. Users may skip the warehouse, in some cases, they may augment it and have some data in a warehouse and some data in a data lake. Thoughts on that, John?

John Santaferraro | Industry Analyst 

Yeah, I mean, just to just to confirm, I think you’re the diagram that you are presenting here shows Presto, kind of above the Cloud Data Lake. But there could be another version of this. If somebody has a data warehouse, and they don’t want to rip and replace, and go to an open source data warehouse, Presto sits above both the data lake and the traditional the data warehouse. So it can unify access for those same tools above it. SQL access for both the data lake, the open source data warehouse and the columnar data warehouse, isn’t that correct?

Dipti Borkar | Ahana 

Absolutely. And I think we’re at a point where we just have to accept that there is proliferation of databases and data sources, and there will be many. There is a time element where you know, not all the data may be in the data lake. And so for those for those use cases, federation, or acquainting across data sources, is how you can correlate data across different data sources. So if the data in, not just a data warehouse, but let’s say Elasticsearch, where you’re doing some analytics, it has not landed in the data lake yet. Or a data warehouse. It has not landed in the data lake yet, and the pipelines are still running, you can have a query that runs across both, and have a unified access to your data lake as well as your database, your data warehouse, or semi structured system of record as well.

John Santaferraro | Industry Analyst 

Awesome. Yes. So one of the things that I want to do for you as an audience, is help you understand what kinds of use cases are best for which platform. The modern data lake and the modern data warehouse. Because it is not one size fits all. And so what you see here is actually very similar things on both sides, very similar use cases. But I tried to rank them in order. And at this point, this is not researched. I do research as well. But this happens to be based on my expertise in this area. And, Dipti, feel free to jump in and agree or disagree. Guess what, hey, we’re on a webinar, but we can disagree, right? So on the data lake side, again, one through eight, high performance data intensive kinds of workloads.

We’re going to talk about a story where there is hundreds of petabytes of information and looking at exabyte scale in the next, probably in the next year or so. That is definitely going to happen on the modern data lake not on the modern data warehouse. The data warehouse on the other side, the modern data warehouse, super high compute intensive kinds of workloads with complex analytics, and many joins, may still work better on the modern data warehouse, both have lower cost storage. Again, back to the data lake side, massive scale, well suitable for many different kinds of data types – structured and unstructured – diversity of the kinds of analytics that you want to run.

And then as you get down toward the bottom, you know, things like high concurrency of analytics, you can see it’s up higher on the right-hand side, where you, again, with a columnar database, or you may be able to support higher levels of concurrency. Now, all of this is moving, by the way, because the modern data lakes are working on “how do I drive up concurrency?” They know they’ve got to do that. I would say that because databases have been around a little bit longer, some of the modern data warehouses have more built-in enterprise capabilities. Things like governance and, and other capabilities. But guess what? All of that is rising on the modern data lake side.

So, from my perspective, this is this is my best guess based on 26 years of experience in this industry. All of this is a moving target because things are constantly changing. Dipti, jump in and you certainly don’t have to agree with me. Think of this as a straw man. What’s your take on use cases for the, for these two worlds? Modern data lakes and modern data warehouses?

Dipti Borkar | Ahana 

Yeah, absolutely. I’m trying to figure out where I disagree, John. But in terms of the criteria, these are some of the criteria that our users come with and say, “Look, we are, we are looking at a modern platform for analytics. We have certain criteria, we want to future proof it.” Future proofing is becoming important, because these, these are important decisions that you make for your data platform. You don’t change your data platform every other day.

A lot of these decisions are thought through very carefully, the criteria are wade-in. And there are different tools for different set of criteria. In terms of data lakes, I would say that the cost aspects and the scale aspects are probably the driving factor for the adoption of data lakes. High performance, I think tends to be more data intensive, you’re right there. You can also run, obviously, a lot of high complexity queries as well on data lakes. Which take Presto, as an example of a query engine, you can still run fairly complicated queries.

However, to your point, John, there is a lot of state of the art in the database world 50 years of research. Of complex joints, optimizer, optimizations, and in general, that we are actually working on to make the data lake stronger to get it at par with the data warehouse. Depending on the kind of queries that are run, what we’re seeing that simple, simple queries, you know, with predicate push, simple predicates, etc., run really great on the lake. There might be areas where the optimizer may not be as capable of figuring out – what is the right way to reorder joins, for example. Where there’s work that’s going on. So I think that most of these are in line with what we’re seeing from a user perspective. The other thing that I would add is the open aspect of it. Most of the data lakes have emerged, the technologies, have emerged from internet companies. And the best part is that they open sourced it. So that has benefited all the users that now have the ability to run Presto or Spark or other things.

But from a warehouse perspective, it’s still very closed, there, there isn’t actually a good open source data warehouse. And as platform teams get more mature and get more skilled, they are looking at ways to interact and contribute back and say, “hey, you know, this feature doesn’t exist yet. Do I wait for a vendor to build it in three years from now? Or can I don’t have the ability to contribute back.” And that’s where the open source aspect that you brought up earlier, starts to play a bigger role, which is not on this list, but it’s also starting to be a big part of decision making, as users and platform teams look at data lakes. They want the ability to contribute back, or at least not get perhaps locked into some extent, and have multiple vendors or multiple people, organizations working on it together so that the technology improves. They have options and they can keep their options open.

John Santaferraro | Industry Analyst 

Yeah, yeah, great, great input. The, you know, the other trend that I’m seeing, Dipti, is the merging of the cloud data warehouse and cloud data lake. And those two worlds coming together. And I and I think that’s driven largely by customer demands, I think that there are still a lot of companies that are running a data warehouse, and they have a data lake. As we’ve talked about the modernization of both of those, and even similarities now, between them that weren’t there 10 years ago, there is a merging of the cloud data warehouses.

Customers don’t want to have to manage two different platforms, with two different sets of resources, two different sets of skill sets. It’s too much and so they want to move from two platforms to one, from two resource types to one, from self-managed, to fully managed, from complex queries joins trying to understand intelligence that requires both the data lake and the data warehouse to a simple way to be able to ask questions of both at the same time. And as a result of that, from disparate to connected intelligence, where I don’t have a separate set of intelligence, that I get out of my data warehouse and a separate set that comes out of the data lake, I have all of my data and I can amplify my insight by being able to run queries across both of those, or in a single platform, that that is able to do the work of what used to be done on the two platforms.

I’m seeing this happen from three different directions. One of them is that traditional data warehouse companies are trying to bring in more complex data types, and provide support for discovery kinds of workloads and data science. On the data lake side, great progress has been made with what you [inaudible] the Open Data Warehouse. Where you can now be able to analyze ORC and parquet files, columnar files. In the same way that you would analyze things on a on a columnar database. So there those two. And then the third, which, go Ahana, is this idea of why not? Why not take SQL, the lingua franca of all analytics, the most common language of all analytics still on the planet today, where there’s the most resources possible, and be able to run distributed queries across both, data lakes and data warehouses and bring the two worlds together. I think that I think this is a this is the direction that things are going, and Dipti, this is, this is where – kudos to Ahana for, you know, for really commercializing and providing support for and bringing into the cloud, all of the capabilities of Presto.

This is not the Ahana version of why I think this is a good idea. This is the John version. SQL access means you leverage this vast number of resources, and every company in the world, both on the technical and the business side, as people who understands and write SQL, better insight, because you’re now looking at data in the data lake in the data warehouse. Unified analytics, which means you can support more of your business use cases, with a distributed query engine. Distributed query engines means that you get to leverage your existing investment in platforms with limitless scale and for all data types. So this is this is my version of the capabilities.

Any thoughts you have on this, Dipti?

Dipti Borkar | Ahana 

Yeah, absolutely. I think that these two spaces are converging, right? There’s the big convergence that’s happening. The way I see from an architecture and technology perspective, is which one do you want to bet on for the future? Where is the bulk of your data? What is what is your primary path that you want to optimize for? The reason that that’s important is, that’s what that’s that will tell you where most of your data lives. Is it 80% in the warehouse? Is it 80% in the lake? And that’s an important decision. If, and this is obviously driven by the requirements, the business requirements that you have, what we’re seeing is that, you know, you have for some, some reports or dashboards where you need really kind of very, very high, high-performance access, the data warehouse would be a good fit.

But there is an emerging trend of different kinds of analysis, some of it which we don’t even know yet, that’s emerging. And having that on in a lake, and consolidating in a lake, gives you the ability to run these future proof kind of engines, platforms, whatever tools that come out on the lake. Because the technologies that are being built on – innovation, a lot more innovation is kind of happening on the lake side of it because of the cost profile of S3, GCS and others. That becomes the fundamental decision.

The next part and the good part is, even if you choose one way or the other, and I will have a bias for you towards the lake, because I think that’s where – I was on the warehouse, I’ve spent many years of my life on the warehouse – but from a future perspective, the next 10 years of analytics, I see that on the data lake. Either one you pick, the good part is you do have a layer on top that can abstract that and can give you access across both. And so you have now the ability which didn’t exist before, to actually query across multiple data sources.

Typically, we’re seeing that it’s the data lake, most people have a data lake, and then they want to query maybe one or two other sources. That’s the use case that we’re seeing. In addition, the cloud, you know, you talked about cloud and full service, is becoming a big, a big, big criteria for users, because installing tuning, then kind of ingesting data, running performance benchmarks, tuning some more, that phase of three to six to nine months of your blog, running POCs is not helping anyone.

Frankly, it doesn’t help the vendors as well, because we want, to create value for customers as soon as possible, right. And so with these managed services, what we’ve done and with Ahana, what I’ve done is we’ve taken three- or six-months process of installing and tuning to a 30 minute process where you can actually run SQL on S3 and get started in 30 minutes. This is in your environment, on your S3 using your catalog, it might be AWS Glue, or it might be a Hive meta store. And that is progress from where we were. And so that the data platform team can create value for their analysts, the data scientists, the data engineers a lot sooner, then with some of these other installed products.

So I see it as a few different dimensions, figure out your requirements, and then try to understand how much time you want to spend on the operational aspects of it. Increasingly, fully managed services are being picked because of the lower operational costs, and the faster time to insight from a data perspective.

John Santaferraro | Industry Analyst 

Great. So the other thing I want to leave you with as an audience is some considerations for any unified analytics decision. There are eight areas here to drill down into, I’m not going to go deep into these, but I want to provide this for you. So you can be thinking about eight areas of consideration as you’re choosing a unified analytics solution.

From a data perspective, what is the breadth of data that can be covered by this particular approach, in terms of unified analytics, moving forward? Looking at support for a broad range of different types of analytics, not just SQL, but Python, Notebook, search, anything that that enhances your analytic capabilities and broadens them, you want to make sure that your solution supports a broad set of users on a single platform. Everybody from the engineer to the business, and the analyst, and the scientist in between. It’s got to be cloud. In my opinion, Cloud is the future. Does the platform support enterprise requirements? All of the business requirements is it cost efficient from a from a business perspective? And then drilling down into the cloud. Looking at things like elasticity, which is automation. Scalability, mobility, because everything’s going mobile and [inaudible]? Am I able to do this as I expand to new regions?

In terms of drilling down on the enterprise – looking at security, privacy, governance, unification for the business is there – Does it support business semantics for my organization, and the logic that I want to include in it? Either in the product or on a layer above right either in some cases, it’s going to be through partners. Is it is it going to allow me to create measurable value for my organization and optimize? Create more value over time and then finally, in terms of costs, is it going to allow me to forecast my cost accurately? Contain costs over time? Looking at things like chargeback and scale, cost at scale. As this thing grows, anybody that’s doing analytics, that analytics program is growing.

So it’s got to be able to scale without just multiplying and creating incremental costs as you grow.

Dipti Borkar | Ahana 

One more thing I would add to costs, John, is the starting cost. So the initial cost, to even try it out, ad get started. This is important, because even the way the platform teams evaluate products and technologies is changing. And they will want the ability to have a pay as you go model. We’re seeing that be quite a bit useful for them. Because sometimes you don’t know until you’ve tried it out for a period of time.

What Cloud is enabling, is also a pay as you go model. So you initially you only pay for what you use, it could be it’s a consumption based model. You can, it might be compute hours, it might be storage, whatever, different vendors might do it different ways, but that is an important. Make sure you have that option. Because it will give you the flexibility of try things out in parallel. And you don’t have to have a exorbitant starting cost of trying out a technology. And the cloud is now allowing you to actually have that option be available.

John Santaferraro | Industry Analyst 

Yeah, yeah. Good. Good point. Dipti. So I had the privilege of interviewing Uber, both the user and developer of Presto, and what an incredible story. I was blown away. First of all, they hyperscale of analytics. Analytics is core to everything that Uber does. The hyperscale was amazing – 10,000 cities – and I’m just going to say it all, even though it’s right there in front of you to read, because it’s amazing. 18+ million trips every single day. They now have 256 petabytes of data, and they’re adding 35 petabytes of new data every day. They’re going to go to exabytes. They have 12,000 monthly, active users of analytics running more than 400,000 queries every single day. And all of that is running on Presto. They have all the enterprise readiness capabilities that they have from automation, workload management, running complex queries, security. It’s an amazing story. Dipti, I mean, you know this story well. What stands out to you about Uber and their not just their use of Presto, but their development of it as well?

Dipti Borkar | Ahana 

Yeah, absolutely. I mean, it’s an incredible story. And there’s many, many other incredible stories like this, where, you know, Presto was using being used at scale. If we refer back to your chart earlier, and we said, you know, scale, where does data lake fit in and where a data warehouse fits in, you probably would not be able to do this with a data warehouse. In fact, they migrated off a data warehouse. It was, I think, Vertica or something like that to Presto. They’ve completed that migration. And not just that they have other databases that sit next to Presto that Presto also queries.

So, you know, this is as perfect a use case for your unified analytics slide that you presented earlier. Because not only is it running on a data lake, it’s petabytes and petabytes of information, but it’s also actually abstracting and unifying across a couple of different a couple of different systems. And Presto is you is being done used for both. It is the de facto query engine for the lake. And it helps in some cases where you need to do a joiner or correlation across a couple of different databases. The other thing I’d add here is that not everybody is at Uber scale.

How many internet companies are there? But what we’re seeing is that users and platform teams, throw away a lot of data, and don’t store it because of cost implications of warehouses. The traditional warehouses, and also the cloud warehouses, may have double the cost. Because you have the data and your lake, but you also have to ingest it in another warehouse. And so you’re duplicating the storage cost. And you’re paying quite a bit more for your warehouse. And so instead of throwing away the data, because it’s cost prohibitive that’s where the lake helps. Store it in S3, you don’t have to throw a compute on it today.

But tomorrow, let’s say that data starts to become more interesting, you can very easily convert it to parquet or in a format – Presto can query JSON and queries, many different formats – and query it with Presto on top, from an analytics perspective, and correlated with other data that you have in S3. So, I would say that instead of, you know, aggregating data, and losing data, data is an asset. And most businesses are thinking about it in that way. It is on your balance sheet yet, there will be a time when you actually weigh the importance of the data you have.

If you have the ability to actually store all this data now, because it is cheap, you can use glacier storage, S3, AWS has really great [inaudible] where you have many different tiers of storage that are possible. And that’s a starting point. So that way, you have the option of a lake and building on a very powerful lake on top of that data, if and when you choose to. So just a few thoughts on that.

John Santaferraro | Industry Analyst 

Yeah, I think the other thing I was impressed with, and I think this is relevant to any size company is the breadth of use cases that they’re able to run on Presto. They’re doing their ETL, data science, exploration, OLAP, and federated queries all on this single platform. They really are contributing back to the Presto open-source code to push real time capabilities with it connection with Pinot sampling, being able to run queries on a sampling of data automatically written more optimizations to increase the performance. And you probably are intimately involved in the open-source projects that are listed here as well.

So, I think that it, it bodes well, for the future of Presto and for the future of Ahana.

Dipti Borkar | Ahana 

Yeah, it’s incredible to be partnering in a community driven project. There are many projects. Presto is a part of the Linux Foundation. And so, it’s a community driven project – Facebook, Uber, Twitter, Alibaba – they founded it Ahana is very early member of the project.

We contribute back, we worked together project Aria, for example that you see here came out of Facebook for optimizing ORC. We are working on ARIA for Parquet. Parquet is a popular format, that Uber can use, and Facebook can use, and other users can use as well. There are other projects, as well, for example, the multiple coordinator project. Presto initially had just one coordinator. And now there’s an alpha available where you have multiple coordinators that it that extends the scale, even beyond for Presto. Reduces the scheduling limitations, we were already talking about 1000s of nodes, but in case you needed more, it can go even beyond. But these are important.

These are important innovations, the performance dimension, and the scale dimension tends to be Facebook, Uber, we are also working on some performance. But the enterprise aspects like security, and governance, high availability, cloud readiness, those are aspects that Ahana is focused on and bringing to the community as well. And excited to see the second half, we have a second half roadmap for the for the for Presto, and excited to see how that comes along.

John Santaferraro | Industry Analyst 

Awesome. So, we started this session by talking about the complexity of Hadoop and open source when it was first launched. And quite frankly, nobody wants to manage 1000 nodes of Presto, unless you’re Ahana maybe? But so, let’s talk about Ahana. What have you guys done to simplify the use of Presto and make it immediately available for anybody who wants to do it? What’s going on with Ahana?

Dipti Borkar | Ahana 

Yeah, absolutely. And maybe, Ali, if I can share a couple of slides just bring up what that looks like in a minute. Okay. John, do you see the screen? Alright? Yes, I do. Okay, great. So Ahana is essentially a SaaS platform for Presto. We’ve built it to be fully integrated, it’s cloud native, and it’s a fully managed service that gives you the best of both worlds. It gives you the ability to have visibility into your clusters, number of nodes, and things like that, but also is built to be very, very easy, so that you don’t have to worry about installing, configurating, tuning variety of things.

How it works is pretty straightforward. You go in and you sign up for Ahana, you create an account. The next thing that you do is we create a compute plane in the users account, in your account. And we set up the environment for you, this is a one time thing, it takes about 20 to 30 minutes. This is bringing up your Kubernetes cluster, setting up your VPC, and your entire environment from all the way from networking on the top to the operating system below. Then from that point, you’re ready to create any number of Presto clusters. And so, it’s a single pane of glass that allows you to create different clusters for different purposes.

You might have an interactive workload for one cluster, you might have a transformation workload for another cluster. And you can scale them independently and manage them independently. So, it’s really, really straightforward and easy to get started. All of this is also to the AWS Marketplace. We’re an AWS first company and product is available pay as you go. So, we only charge for the Presto usage that you might have on an hourly basis. And so that’s really kind of how it works.

At a high level, just to summarize some of the important aspects of the platform, one of the key decisions we made is – Do you bring data to compute? Or do you take compute and move it to data? I thought about it from a user perspective. This was an important design decision we made. Increasingly users, data is very valuable, as I said earlier, incredibly valuable, and users want to move it out of their environment. Snowflake and other data warehouses are doing incredibly well. But if they had a choice, they would keep it in their own environment. What we’ve done is we take Presto, anything that touches data, Presto clusters, Hive meta store, even the Superset, so we have an instance of superset that provides a an admin console for Ahana, all of these things run in the user’s environment and the users VPC. None of this information ever crosses over to Ahana SaaS. And that’s very important.

From a governance perspective, there’s a lot of GDPR requirements increasingly, and so on. That’s, that’s the way it’s designed at a high level. Of course, as you mentioned, John, we connect to the data lake, that’s our primary path. 80% of the workloads we see are for S3, but the 5% to 10% might be for some of the other data sources. You can federate across RDS, MySQL, Redshift data warehouse, for example, Elastic and others. And we have first class integrations with Glue. Again, very, very easy to integrate, you can bring your own catalog, or we can, you can have one that were created with a click of a button in Ahana.

You can bring your own tools on the top, it’s standard JDBC ODBC. As you said, SQL is the lingua franca, it’s anti SQL. Presto is anti SQL. And so that makes it very easy to get started with any tool on top, and to integrate it into your environment.

So that’s a little bit about Ahana. And I think that might bring us to the end of our discussion here.

Ali LeClerc | Ahana 

Great. Well, thank you, Dipti and John. What a fantastic discussion and I hope everybody got a good overview of data, lakes, data warehouses, what’s going on in the market and kind of how to make a decision on and which way to go. So, we have a bunch of questions. I don’t think we’re going to have enough time to get to all of them. So, I’m going to ask some of the more popular ones that have kept popping up.

So first is around Presto. Dipti, probably for you, is Presto a means of data virtualization?

Dipti Borkar | Ahana 

Yeah, so that’s a good question. Presto was built as a data lake engine. But given its pluggable architecture, it is also able to support, whether you call virtualization I would say is an overloaded term, it means many things. But if it means accessing different data sources, then yes, it’s capable of doing that like you just saw in my last slide.

Ali LeClerc | Ahana 

Great. And by the way, folks, we do have another webinar. This is the first webinar in our series. Next week, we’ll be going into more detail on how you can actually do SQL on the data lake. Highly recommend if you’re interested in learning more, going a bit deeper, checking that out. I dropped the link to register in the chat box. So, feel free to do that.

So, question, I think I think for both of you Dipti, earlier, you touched on this idea of augmenting the data warehouse versus perhaps skipping the data warehouse altogether. And so, Dipti and, John, I think you both kind of bring a different perspective to that. What are you seeing in the market? Are people facing that decision? Or is it leaning one way or the other what’s going on around augmenting versus skipping?

John Santaferraro | Industry Analyst 

One of the trends that I’m seeing is that when data originates in the cloud, it tends to stay in the cloud, and it tends to move to a modern architecture. So, in truly digital instances, for organizations, rather than taking digital data and trying to get it back into a legacy or traditional data warehouse, that’s almost always going into a data lake and into which you know, I love what you term Dipti, the Open Data Warehouse using those formats.

That said, as people continue to migrate to the cloud – when I was at EMA, there was a, we saw that approximately 53% of data was already in the cloud. But that means 47% of the data is still on premise. And so, if the data is already there, and in a database, that migration may or may not make sense. You have to kind of weigh the value. And oftentimes the value is having it all in a single unified analytics warehouse.

Dipti Borkar | Ahana 

Right? Yeah, what I would say is that I think it depends on the cloud or on prem, most of our discussion has been in the cloud, because we are forward looking people, forward thinkers. But the truth is, there really is a lot of data on prem. On prem, what we’re seeing is that it will always almost be augment.

Because most folks will have a warehouse, whether it’s Vertica or Teradata, or DB2 Oracle, whichever it is, and they might have an HDFS, kind of a Hadoop kind of system on the side.  And that’s, that would be augment that’s more traditional. In the cloud, I think we’re seeing both, we’re seeing that for users who have been on the warehouse, they are choosing to augment and not just migrate off completely. And I think that that is the right thing to do, you do want to have a period of time. When I say period, it’s a years of time, where if you have a very mature warehouse, it will take some time to migrate that workload over to the lake. And so new workloads will be on the lake, old workloads will slowly migrate off. So that’s the way we see it’s really augment for a period of time.

You know, I often joke that mainframes are still around. So, warehouses aren’t going anywhere. And so that’s the argument. Now, the pre-warehouse users who don’t have a warehouse yet, are choosing, I would say, that percentage will continue to increase. I’m seeing that about 20-30% are choosing to skip the warehouse. That will only increase as more capabilities get built on the lake. Transactionality is very early right now. Governance, which just starting to get to column level, row level, masking, filtering, and masking and so on. So, there’s some work to be done.

We have our work cut out for us on the lake. I see it as a three-to-five-year period, where this will then start moving and more and more users will end up skipping the warehouse and moving to the lake. But today, it’s depending on the use cases, the simple use cases, we are seeing them about 20-30% are just going directly to the lake.

Ali LeClerc | Ahana  Wonderful. So, with that, I think we are over time now. We appreciate everybody who stuck around and stayed a few minutes past the hour. We hope that you enjoyed the topic. John, Dipti – what a fantastic conversation. Thanks for sharing your insights into this into this topic. So, with that, everybody, thank you. Thanks for staying with us we hope to see you next week and see you next time. Thank you.

Speakers

John Santaferraro
Industry Analyst

Dipti Borkar
Cofounder & CPO, Ahana

Tutorial: How to run SQL queries with Presto on Amazon Redshift

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial is about how to run SQL queries with Presto (running with Kubernetes) on AWS Redshift.

Presto’s Redshift connector allows querying the data stored in an external Amazon Redshift cluster. This can be used to join data between different systems like Redshift and Hive, or between two different Redshift clusters. 

Step 1: Setup a Presto cluster with Kubernetes 

Set up your own Presto cluster on Kubernetes using these instructions or you can use Ahana’s managed service for Presto

Step 2: Setup a Amazon Redshift cluster

Create an Amazon Redshift cluster from AWS Console and make sure it’s up and running with dataset and tables as described here.

Below screen shows Amazon Redshift cluster – “redshift-presto-demo” 

 Further, JDBC URL from Cluster is required to setup a redshift connector with Presto.

You can skip this section if you want to use your existing Redshift cluster, just make sure your redshift cluster is accessible from Presto, because AWS services are secure by default. So even if you have created your Amazon Redshift cluster in a public VPC, the security group assigned to the target Redshift cluster can prevent inbound connections to the database cluster. In simple words, Security Group settings of Redshift database play a role of a firewall and prevent inbound database connections over port 5439.Find the assigned Security Group and check its Inbound rules. 

If your Presto Compute Plane VPC and data sources are in a different VPC then you need to configure a VPC peering connection.

Step 3: Configure Presto Catalog for Amazon Redshift Connector

At Ahana we have simplified this experience and you can do this step in a few minutes as explained in these instructions.

Essentially, to configure the Redshift connector, create a catalog properties file in etc/catalog named, for example, redshift.properties, to mount the Redshift connector as the redshift catalog. Create the file with the following contents, replacing the connection properties as appropriate for your setup:

connection-password=secret
connector.name=redshift
connection-url=jdbc:postgresql://example.net:5439/database
connection-user=root

This is how my catalog properties look like –

  my_redshift.properties: |
      connector.name=redshift   
      connection-user=awsuser
      connection-password=admin1234 
connection-url=jdbc:postgresql://redshift-presto-demo.us.redshift.amazonaws.com:5439/dev

Step 4: Check for available datasets, schemas and tables, etc and run SQL queries with Presto Client to access Redshift database

After successfully database connection with Amazon Redshift, You can connect to Presto CLI and run following queries and make sure that the Redshift catalog gets picked up and perform show schemas and show tables to understand available data. 

$./presto-cli.jar --server https://<presto.cluster.url> --catalog bigquery --schema <schema_name> --user <presto_username> --password

IN the below example you can see a new catalog for Redshift Database got initiated called “my_redshift. ”

presto> show catalogs;
   Catalog   
-------------
 ahana_hive  
 jmx         
 my_redshift 
 system      
 tpcds       
 tpch        
(6 rows)
 
Query 20210810_173543_00209_krtkp, FINISHED, 2 nodes
Splits: 36 total, 36 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Further you can check all available schemas for your Amazon Redshift from Presto to work with.

presto> show schemas from my_redshift;
       Schema       
--------------------
 catalog_history    
 information_schema 
 pg_catalog         
 pg_internal        
 public             
(5 rows)
 
Query 20210810_174048_00210_krtkp, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0:01 [5 rows, 85B] [4 rows/s, 72B/s]

Here, I have used sample data that comes with Redshift Cluster setup. I have chosen the schema “public” which is a part of “dev” Redshift Database. 

presto> show tables from my_redshift.public;
  Table   
----------
 category 
 date     
 event    
 listing  
 sales    
 users    
 venue    
(7 rows)
 
Query 20210810_185448_00211_krtkp, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0:03 [7 rows, 151B] [2 rows/s, 56B/s]

Further, you can explore tables as “sales” in the below example.

presto> select * from my_redshift.public.sales LIMIT 2;
 salesid | listid | sellerid | buyerid | eventid | dateid | qtysold | pricepaid | commission |        saletime         
---------+--------+----------+---------+---------+--------+---------+-----------+------------+-------------------------
   33095 |  36572 |    30047 |     660 |    2903 |   1827 |       2 | 234.00    | 35.10      | 2008-01-01 01:41:06.000 
   88268 | 100813 |    45818 |     698 |    8649 |   1827 |       4 | 836.00    | 125.40     | 2007-12-31 23:26:20.000 
(2 rows)
 
Query 20210810_185527_00212_krtkp, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:03 [18.1K rows, 0B] [6.58K rows/s, 0B/s]

Following are some more complex queries you can run against sample data:

presto:public> -- Find top 10 buyers by quantity
            ->SELECT firstname, lastname, total_quantity 
            -> FROM   (SELECT buyerid, sum(qtysold) total_quantity
            ->         FROM  sales
            ->         GROUP BY buyerid
            ->         ORDER BY total_quantity desc limit 10) Q, users
            -> WHERE Q.buyerid = userid
            -> ORDER BY Q.total_quantity desc;
 firstname | lastname | total_quantity 
-----------+----------+----------------
 Jerry     | Nichols  |             67 
 Armando   | Lopez    |             64 
 Kameko    | Bowman   |             64 
 Kellie    | Savage   |             63 
 Belle     | Foreman  |             60 
 Penelope  | Merritt  |             60 
 Kadeem    | Blair    |             60 
 Rhona     | Sweet    |             60 
 Deborah   | Barber   |             60 
 Herrod    | Sparks   |             60 
(10 rows)
 
Query 20210810_185909_00217_krtkp, FINISHED, 2 nodes
Splits: 214 total, 214 done (100.00%)
0:10 [222K rows, 0B] [22.4K rows/s, 0B/s]
 
presto:public> -- Find events in the 99.9 percentile in terms of all time gross sales.
            -> SELECT eventname, total_price 
            -> FROM  (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) as percentile 
            ->        FROM (SELECT eventid, sum(pricepaid) total_price
            ->              FROM   sales
            ->              GROUP BY eventid)) Q, event E
            ->        WHERE Q.eventid = E.eventid
            ->        AND percentile = 1
            -> ORDER BY total_price desc;
      eventname       | total_price 
----------------------+-------------
 Adriana Lecouvreur   | 51846.00    
 Janet Jackson        | 51049.00    
 Phantom of the Opera | 50301.00    
 The Little Mermaid   | 49956.00    
 Citizen Cope         | 49823.00    
 Sevendust            | 48020.00    
 Electra              | 47883.00    
 Mary Poppins         | 46780.00    
 Live                 | 46661.00    
(9 rows)
 
Query 20210810_185945_00218_krtkp, FINISHED, 2 nodes
Splits: 230 total, 230 done (100.00%)
0:12 [181K rows, 0B] [15.6K rows/s, 0B/s]

Step 5: Run SQL query to join data between different systems like Redshift and Hive

Another great use case of Presto is Data Federation. In this example I will join Apache Hive table with Amazon Redshift table and run JOIN query to access both tables from Presto. 

Here, I have two catalogs “ahana_hive” for Hive Database and “my_redshift” for Amazon Redshift and each database has my_redshift.public.users

 and ahana_hive.default.customer table respectively within their schema.

Following the very simple query to join these tables, the same way you join two tables from the same database. 

presto> show catalogs;
presto> select * from ahana_hive.default.customer;
presto> select * from my_redshift.public.users;
presto> Select * from ahana_hive.default.customer x  join my_redshift.public.users y on x.nationkey = y.userid;

At Ahana, we have made it very simple and user friendly to run SQL workloads on Presto in the cloud. You can get started with Ahana Cloud today and start running sql queries in a few mins.

How do I use the approx_percentile function in Presto?

The Presto approx_percentile is one of the approximate aggregate functions, and it returns an approximate percentile for a set of values (e.g. column). In this short article, we will explain how to use the approx_percentile function.

What is a percentile?

From Wikipedia:

In statistics, a percentile (or a centile) is a score below which a given percentage of scores in its frequency distribution falls (exclusive definition) or a score at or below which a given percentage falls (inclusive definition)

To apply this, we’ll walk through an example with data points from a known, and arguably most famous, distribution—-the Normal (or Gaussian) distribution. The adjacent diagram plots the density of a Normal distribution with a mean of 100 and standard deviation of 10. If we were to sample data points from this Normal distribution, we know that approximately half of the data points would be less that the mean and half of the data points would be above the the mean. Hence, the mean, or 100 in this case, would be the 50th percentile for the data. It turns out that the 90th percentile would approximately be 112.82; this means that 90% of the data points are less than 112.82.

approx_percentile by example

To solidify our understanding of percentiles and the approx_percentile function, we’ve created a few tables to use as example:

presto:default> show tables;
    Table
-------------
 dummy
 norm_0_1
 norm_100_10
 norm_all
(4 rows)
TableDescriptionNumber of Rows
dummySingle column 100 row table of all ones except for a single value of 100.100
norm_0_1Samples from normal distribution with mean of 0 and standard deviation of 1.5000
norm_100_10Samples from normal distribution with mean of 100 and standard deviation of 10.5000
norm_allCoalescence of all normal distribution tables.10000
Table 1

The approx_percentile function has eight type signatures. You are encouraged to review the Presto public documentation for all the function variants and official descriptions. The set of values (e.g. column) is a required parameter and is always the first argument.

Another required parameter is the percentage parameter, which indicates the percentage or percentages for the returned approximate percentile. The percentage(s) must be specified as a number between zero and one. The percentage parameter can either be the second or third argument of the function, depending on the intended signature. In the following examples, the percentage parameter will be the second argument. For example, approx_percentile(x,0.5) will return the approximate percentile for column x at 50%. For data points in our norm_100_10 table, we expect the returned value to be around 100.

presto:default> select approx_percentile(x,0.5) from norm_100_10;
      _col0
------------------
 99.8184647799587
(1 row)

approx_percentile(x,0.9) will return the approximate percentile for column x at 90%, which for the data in norm_100_10 table should be around 112.82.

presto:default> select approx_percentile(x,0.9) from norm_100_10;
      _col0
------------------
 112.692881777202
(1 row)

In a single query, you can also specify an array of percentages to compute percentiles: approx_percentile(x,ARRAY[0.5, 0.9]).

presto:default> select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_100_10;
                _col0
--------------------------------------
 [99.8184647799587, 112.692881777202]
(1 row)

We can ask for multiple percentages for our dummy table, which consists of a 100 rows of all ones except for a single value of 100. Hence, we expect all percentiles below 99% to 1.

presto:default> select approx_percentile(x,ARRAY[0.1, 0.5, 0.98, 0.99]) from dummy;
         _col0
------------------------
 [1.0, 1.0, 1.0, 100.0]
(1 row)

We can also use a GROUP BY clause to segment the values to compute percentiles over. To illustrate this, we will use our norm_all table, which contains values from both the norm_100_10 and the norm_0_1 tables. The m and sd columns specify the mean and standard deviation of the normal distribution the corresponding x value is sampled from.

presto:default> select m, sd, x from norm_all order by rand() limit 10;
  m  | sd |         x
-----+----+--------------------
 0   | 1  | -0.540796486700647
 0   | 1  |   0.81148151337731
 0   | 1  |   1.28976310661005
 100 | 10 |   97.0272872801269
 100 | 10 |   83.1392343835652
 0   | 1  | -0.585678877703149
 0   | 1  |  0.268589447255106
 0   | 1  | -0.280908719376113
 100 | 10 |    104.36328077332
 0   | 1  |  0.266294347905949
(10 rows)

The following query then will return approximate percentile for 50% and 90% for data points grouped by the same values of m and sd (i.e. from the same normal distribution): select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_all group by grouping sets ((m,sd)). As expected, we see that the approximate 50% and 90% percentile is around 100 and 112.82 for a mean of 100 and standard deviation of 10 and around 0 and 1.28 for a mean of 0 and standard deviation of 1.

presto:default> select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_all group by grouping sets ((m,sd));
                  _col0
------------------------------------------
 [99.8563616481321, 112.879972343696]
 [-0.00458419083839064, 1.30949677294588]
(2 rows)

An optional parameter is accuracy, which controls the maximum rank error and defaults to 0.01. The value of accuracy must be between zero and one (exclusive) and must be constant for all input rows. We can add accuracy as the third argument to our function. For example, approx_percentile(x,0.9,0.5) will return the approximate percentile for column x at 90% with 0.5 maximum rank error accuracy. By allowing for larger error (from the 0.01 default), we can see the approximate percentile of 113.50 is farther away from our true value of 112.82 than our previous result of 112.88.

presto:default> select approx_percentile(x,0.9,0.5) from norm_100_10;
       _col0
--------------------
 113.49999999999999
(1 row)

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

Presto Company Ahana Raises $20M Series A Led By Third Point Ventures To Redefine Open Data Lake Analytics

Funding comes on heels of major momentum in customer and community adoption for Presto

San Mateo, Calif. – August 3, 2021 — Ahana, the SaaS for Presto company, today announced it has raised $20 million in Series A funding to transform open data lake analytics, bringing total funds raised to $24.8 million. The financing round was led by Third Point Ventures and included existing investors GV (formerly Google Ventures), Leslie Ventures, and Lux Capital. Robert Schwartz, Managing Partner, Third Point Ventures will join Ahana’s Board of Directors. 

Today more companies are augmenting the traditional cloud data warehouse with cloud data lakes like AWS S3 due to its affordability and flexibility, allowing for the storage of a lot more data in different formats. But analyzing that data is challenging for data platform teams. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. With this investment, Ahana will continue to transform the open data lake analytics market with the only Presto SaaS by further accelerating engineering and contributions to the open source community, as well as expanding its go-to-market teams.

“We’re excited to join the exceptional team at Ahana and assist them in their evolution from rapid, early adoption to substantial market prominence. As we witness the evolution of modern analytics, we’re seeing a new stack emerge adjacent to the data warehouse. Companies need an open, flexible approach to access their data, and the data lake with Presto on top provides that,” said Robert Schwartz, Managing Partner, Third Point Ventures. “Ahana Cloud provides the foundation for this new stack, giving all platform teams the ability to easily use Presto on their data lake. With Ahana, any company can leverage the open data lake for their analytics. This is extremely powerful.”

Under the Linux Foundation’s Presto Foundation, the Presto open source project has seen massive growth just in the past six months including hundreds of thousands of pulls of the Docker Sandbox Container for Presto hosted by Ahana, over 1,000 members in global Presto meetups, and a total of ten companies that are now part of the Presto Foundation.

“From day one Ahana has focused on delivering the easiest Presto managed service for open data lake analytics in the cloud, giving data platform teams the ability to provide high performance SQL analytics on their S3 data lakes,” said Steven Mih, Cofounder and CEO, Ahana. “As more open source-based companies like Confluent and Neo4J see extreme momentum in today’s market, Ahana’s open source go-to-market coupled with its deep involvement with the Presto Foundation has positioned Ahana as the Presto company and leader in the open data lake analytics space.”

“Since its launch in June of 2020, Ahana has in a short time established itself as the Presto company, bringing a solution to market that enables any team to use Presto in the cloud for their data lake analytics,” said Mark Leslie, Managing Director, Leslie Ventures. “Couple that with the momentum we’re seeing in the Presto community, I look forward to even more from the Ahana team as they execute on their vision of the open data lake analytics stack with Presto.”

TWEET THIS: @Ahana raises $20M Series A to redefine open data lake analytics led by @ThirdPointVC #Presto #OpenSource #Analytics #Cloud https://bit.ly/3fa5HsO 

About Ahana

Ahana, the Presto company, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Leslie Ventures, Lux Capital, and Third Point Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Announcing the Ahana $20M Series A – Furthering our Vision of Open Data Lake Analytics with Presto

I’m very excited to announce that Ahana, the SaaS for Presto company, has raised a jumbo $20M Series A round from lead investor Third Point Ventures. Our SaaS managed service for open source Presto enables data platform teams to easily and cost effectively deliver powerful SQL processing on all their data. This is our vision of Open Data Lake Analytics and it’s what Facebook, Uber, Twitter, and others in the Presto Foundation have been running at scale.

It’s been only 15 months since the founding of Ahana and I’m extremely proud of what the team has achieved to date. We came out of stealth last year with seed investments totalling $4.8M from GV (Google Ventures), Lux Capital, and Leslie Ventures, who all also participated in this oversubscribed A round. Initially we focused on the Presto community, providing support, open source contributions, and tutorials to help people get started easily with Presto. Subsequently, we announced the first managed service for Presto at PrestoCon in September and GA’d two months later at AWS Re:Invent in December with early customer acclaim. 

I am also excited to have Rob Schwartz, Managing Partner of Third Point Ventures join our board and to partner with Third Point that has tremendous public market investing expertise, relationships, and research capabilities. Third Point has over $25B under management and Rob drives their early stage, growth, and cross-over investments in their emerging technology arm, Third Point Ventures. I value his hands-on experience with young companies, helping them deliver on their vision. For example, he is an active investor in Yellowbrick in the data space. Rob and Third Point had four key reasons to invest in Ahana…the team, the project, the market, and the product-market fit as evidenced by customer traction. Let me touch on these four areas:

  1. THE TEAM: This is my 6th startup tour of duty and I’ll be the first to attest that startups require an incredible amount of energy. That energy comes from the synchronized rowing of the oars to propel the ship, no matter the size (or startup stage of company growth). I’m most proud of our extraordinary team; all their hands on deck that have pulled together so far. The Ahana team includes experts in a range of industry-leading databases: Teradata, IBM DB2, Vertica, Aster Data, and the recently IPO’d Couchbase. In addition to experts in open source & Presto hailing from Facebook, Uber, Walmart and Alibaba. 
  2. THE OPEN SOURCE PRESTO PROJECT: Since we started Ahana last year, the momentum we’ve seen in the Presto community has been phenomenal. We helped the Presto Foundation lead two massively successful PrestoCon’s and numerous meetups across the world, whose membership have crossed over 1000 members. The Docker Sandbox Container for Presto hosted by Ahana has had over 250K pulls just in the last 6 months, and 10 companies are now part of the growing Presto Foundation consortium. We’re both thankful to all those in the community who have helped make Presto what it is and humbled to be a part of its success. We pledge to continue open source code contributions for the benefit of the Presto community for many years to come.
  1. THE MARKET: This $20MM series A raise enables us to further our vision of providing the most flexible and performant Open Data Lake Analytics with Presto. Open Data Lake Analytics is quickly becoming the next big emerging stack, led by hyperscaler bellwether companies like Facebook and Uber. While the data warehouse has been the workhorse of analytics,  it’s also very expensive. To address that, more data is moving into cloud data lakes like AWS S3 because they are so inexpensive and ubiquitous. So we’re seeing the data warehouse getting deconstructed in the cloud with the commodified data lake coupled with open source Presto. This stack enables SQL analytics directly on the data lake, making it the most cost effective, open, and flexible solution.  
  2. PRODUCT-MARKET FIT: Ahana is at the intersection of some rapidly growing trends right now – Open Source, Cloud Analytics, and the Data Lake. And our customer base is proof of that. Having only GA’d this past December, the adoption we’ve seen has been incredible. Companies like Cartona, an eCommerce company out of Egypt, and Carbon, a fast growing ad tech company, are using Ahana and telling the world about their use cases. Securonix, the Gartner magic quadrant leader in the security SIEM space, is a huge proponent of Presto and Ahana and recently joined us at the AWS Startup Showcase. Our customers are building out reporting and dashboarding, customer-facing analytics, data transformation use cases, and much more with Ahana Cloud for Presto. We can’t wait to see what they do next.

Lastly, I’ll mention that this raise enables us to accelerate growth in three main areas: 

  1. Technical innovation of the Presto project, by scaling our product and engineering teams. Btw, we’re an all remote company. 
  2. Adoption via more evangelism for the Presto open source project. We will continue working closely with the community and other Presto Foundation members like Facebook, Uber, and Intel.
  3.  Growing our Marketing and Sales organizations, continuing our focus on customer adoption.

To sum it up, I’d like to share what Rob Schwartz, Managing Partner of Third Point Ventures says about us:

“We’re excited to join the exceptional team at Ahana and assist them in their evolution from rapid, early adoption to substantial market prominence. As we witness the evolution of modern analytics, we’re seeing a new stack emerge adjacent to the data warehouse. Companies need an open, flexible approach to access their data, and the data lake with Presto on top provides that. Ahana Cloud provides the foundation for this new stack, giving all platform teams the ability to easily use Presto on their data lake. With Ahana, any company can leverage the open data lake for their analytics. This is extremely powerful.”

Cheers to our next phase of growth, and did I mention we’re hiring? 😉 

We are just getting started. Join us on this incredible journey at the intersection of cloud, data, and open source…what many unicorns are made of.

Autoscale your Presto cluster in Ahana Cloud

We’re excited to announce that autoscaling is now available on Ahana Cloud. In this initial release, the autoscaling feature will monitor the worker nodes’ average CPU Utilization of your presto worker nodes and scale-out when reaching the 75% threshold. Additionally, Presto clusters have now the ability to scale-in to a minimum number of worker nodes when the cluster is idle for a user-specified amount of time.

Never run out of memory with autoscaling

One of the challenges of running a Presto cluster is to make the right decision in terms of the number of worker nodes required to run your queries. Not all queries are equals and predicting how many nodes will be required is not always possible. With the scale-out feature, the number of worker nodes increases based on the CPU utilization to ensure that your queries can execute without running out of memory. That way you don’t have to worry about whether your deployment can support your requirements. Future iterations will include scale-in based on CPU utilization and autoscaling based on additional metrics.

Save cost with Idle time

When no queries are sent to the Presto clusters, it makes sense to reduce the number of workers nodes but it’s not always practical to do so manually. With the Idle time feature enabled, the system will monitor the queries activity, if no activity is detected for a user-defined period of time, let’s say 15mins, then the number of worker nodes will reduce to its minimum count.

Two common use cases we found that benefit greatly from idle time cost saving are transformation workloads and ad hoc querying.

  • For the transformation workload, the query can potentially run for several hours, making it unpractical to monitor its activity to decide when to manually stop the cluster or reduce the number of running nodes. Idle time cost savings wait for a certain period of inactivity and then reduce the worker node to a minimum automatically until the next query hits the cluster again.
  • For ad hoc querying, like its name suggests, the querying is not continuous and scaling in to a minimum worker node count between each queries will help reduce costs.

Enabling autoscaling

Getting started with autoscaling is easy with this step-by-step walkthrough.

Step 1 – In Cluster settings select Scale Out only (CPU) scaling strategy

Step 2 – Enter a Minimum and a Maximum worker node count as well as a Scale Out step size. The scale-out step size will decide how many nodes get added to the cluster when the scaling out triggers

Step 3 – By default, the cluster will resize to its minimum worker node count defined above after 30mins, this can be set between 10mins and 1 hour

Your new Presto cluster will scale-out up to its maximum worker node count as long as the average CPU Utilization of the worker nodes goes beyond 75%. However, if no queries reach the cluster for a default period of 30mins then the cluster will reduce its worker node count to its minimum.

Enabling Idle time cost saving

Enabling Idle time cost saving is very easy with this step-by-step walkthrough.
As shown in the section above, idle time cost saving is enabled by default in the Scale Out only (CPU) scaling strategy.

For the Static cluster, to enable the feature, you will need to do the following:

Step 1 – Check Scale to a single worker node when idle

Step 2 – By default, the cluster will resize to a single worker node after 30mins, this can be set between 10 mins and 1 hour.

Changing the autoscaling configuration of an existing cluster

You can always change the configuration after a cluster got created by following the steps below:

Step 1 – Navigate to the cluster details view

Step 2 – Edit the cluster scaling policy configuration

Step 3 – The server will update its configuration immediately after clicking the Save button

What’s next?

Log in to the Ahana Cloud console to get started. You can also learn more about autoscaling by heading over to our documentation.

On-Demand Presentation

Community Roundtable: Open Data Lakes with Presto, Apache, Hudi & AWS S3

As we see more companies augment the traditional cloud data warehouses and in some cases replace their data warehouses with cloud data lakes, a new stack has emerged that supports data warehouse workloads that weren’t possible on a data lake before while bringing a lot more advantages like lower cost, flexibility and no lock-in with open formats and open interfaces. 

The new stack: Presto + Apache Hudi + AWS Glue and S3 = The PHAS3 stack

Unlike the cloud data warehouse which is closed source, has data stored in proprietary formats, tends to be very expensive, and assumes data needs to be ingested and integrated into one database to provide the critical business insights for decision-making, the PHAS3 stack is open, flexible, and affordable.

In this roundtable discussion experts from each layer in this stack – Presto, AWS, and Apache Hudi – discuss why we’re seeing a pronounced adoption to this next generation of cloud data lake analytics and how these technologies enable open, flexible, and highly performant analytics in the cloud.


Webinar Transcript:

SPEAKERS

Vinoth Chandar | Hudi, Roy Hasson | AWS, Dipti Borkar | Ahana, Eric Kavanaugh | Bloor Group

Eric Kavanaugh | Bloor Group

Ladies and gentlemen, hello and welcome to the Community Roundtable. Yours truly, Eric Kavanaugh, is here, frankly humbled to be with such experts. We’re going to talk about open data lakes with the new stack. And this is a great moniker, we’re actually using it ourselves for something slightly different. But they’re both along the same lines. The new stack refers to the new technology components that you can weave together to build out your enterprise computing platform. And what’s happening these days is absolutely amazing.

So folks, open data lakes with Presto, Apache Hudi, and AWS Glue, and of course, S3, the next generation of analytics. We’re going to talk to my good friend Dipti Borkar, from Presto, also Vinoth Chandar, from Apache Hudi. And Roy Hassan, and there he is, he was the voice of the cloud a minute ago, now he’s visual. So AWS has materialized in our view.

We’re going to talk about what this all really means it folks, just very quickly, I’m very excited and very bullish about what we’re seeing here. I’ve been tracking this industry for over 20 years. It was about 15 or 16 years ago that I interviewed a guy named Michael Stonebreaker, Dr. Michael Stonebreaker, who was talking about how one size does not fit all. Again, this is 2005, he was promoting something called Vertica. And he’s basically saying that, look, the relational database has won for some reason in the enterprise, yet his theories about that. But he said, that doesn’t make sense. There’s a need for purpose, build specialized technologies, for use cases, that relational is not very good at doing. And so he was pushing Vertica back then, which of course is now part of HPE. It’s kind of bounced around a bit, but column oriented database. Around the same time, I also started researching open source and I remember distinctly kind of crystallizing thoughts in my head about the service oriented architecture, and open source technology. And I thought to myself, this is some interesting handwriting on the walls, it’s going to impact the major players sooner or later, Oracle, IBM, SAP, etc. So these guys are going to have to wake up to what’s happening here, because SOA, if done properly, will enable the mixing and matching of component parts.

Well, fast forward 16 years, here we are today, talking about the new stack. What happened? Open source is a huge part of the equation here, folks, open source has recast the enterprise development world. What you’re going to see today from our guests, I’ve like I said Vinoth Chandar, from Apache Hudi, Roy Hassan from AWS, Dipti Borkar from Presto. What you’re going to hear about is what this new stack really means. And it’s very exciting because we basically taken almost like the old database, that was great. We’ve taken that as a microcosm and build a macrocosm out of it. So now you have different component parts like Hudi, like Amazon Glue, like S3, like Presto, for example, the folks that Presto, what a Ahana is bringing to the table as well.

This is a new stack, a new way of doing things. And of course, we saw with Snowflakes IPO, that data warehousing is alive and well. But that’s sort of a closed circuit way of going about it right? You’re trapped inside of Snowflake, you have to pay them for compute. They’re very clever about separating compute and storage, and they’re very clever about spinning up warehouses and taking them down. But nonetheless, it’s still a closed environment. We’re going to talk about the open environment today.

So let’s go around the room and introduce our guests. I’ll ask them to introduce themselves Dipti Borkar, I’ll throw it over to you first, tell us a bit about yourself and what Presto is.

Dipti Borkar | Ahana 

Yeah, hello, everyone, and great to be here with Vinoth, Roy and Eric have worked and interacted with you on various different projects. Looking forward to this discussion. I’m the co founder and the Chief Product Officer at Ahana and also chair of the Presto Foundation Community team, been in open source for over 10 years.

You know, you talked about the range of databases, I spent a lot of time on the relational database warehouse with distributed DB2 and the core storage and indexing kernel, spent time and transition to no SQL at Couchbase. Many years there. Building a range of different technologies, SQL on JSON, and fast forward a few years, we are seeing a new paradigm emerge with data lakes and building SQL for S3. So Presto essentially, is a distributed query engine. it’s an Open Source Engine created at Facebook, Open Source by them. It’s part of the Linux Foundation, under the Presto foundation, and it’s built to be a great engine for data lakes, as well as other databases.

As you mentioned there is polyglot persistence as many different options that people have, and you can also federate across with Presto. Primarily, it’s being used on top of the data lake. You know, you mentioned Snowflake, this stack is really an open lake. You have an augmenting, you have an approach where we’re augmenting some of these data warehouses. I’m seeing a lot of different users, community users, customers move to this stack, where you have a query engine, you have a transaction manager layer in there, you have a data catalog, like AWS Glue. And then obviously, the Cloud Object Storage, which is S3. So that’s a little bit about me and Presto.

Vinoth over to you.

Vinoth Chandar | Hudi

Yeah. Hey, my name is Vinoth and I’m the PMC chair for the Apache Hudi project that the ASF. And yeah, my background: I’ve done databases for or a decade now. Basically, a one trick pony. I started on databases at Oracle working on CDC, you know, Oracle GoldenGate, Extreme, all of these products. Then let the key Walmart key value store at LinkedIn, through the kind of the hyper growth phases of LinkedIn. Then, you know, Bolt, another brief standard box where it was building, like, kind of like a Firebase replacement, then landed at Uber, where we created Apache Hudi. Since then, we’ve been also growing a project outside.

Most recently, I was at Confluent, where I was working on ksqlDB. And to me, you know, Hudi underscore, started with transactional layer on top of some Hadoop compatible, Hadoop file system compatible storage, like, you know, HDFS or S3, or object stores in general, brought mutability to data that you would store on top of these, you know, object stores, indexing, and you know, all of the functionality that you need to build kind of an optimized data plane on top of object storage. That’s what Hudi provides.

Over the years, we also built a good set of platform components on top of this layer that also complete the picture in terms of – how do you bring the data in – external data ingestion, and kind of self management, just like how databases have a lot of demons that are optimizing it in the background for you, Hudi already comes with a runtime that did all of this happens out of the box for you.

Eric Kavanaugh | Bloor Group 

And last but not least, Roy Hasson from AWS. Tell us a bit about yourself.

Roy Hasson | AWS 

Sure. Hi, everyone. Again, my name is Roy Hasson, product manager on the AWS Glue and AWS Lake Formation team. I’ve been with AWS for about five years, actually five and a half years almost, working with a lot of different customers on building these type of data lakes. I’ve also been heavily involved in the early launch of Amazon, Athena and AWS Glue services. So been in the weeds with a lot of customers, really trying to take this vision and implement it in a way that is scalable and meets their needs. Definitely learned a lot throughout these years. This is kind of the feedback that we’ve been pushing into our services to try to make it easier, to make them better integrated. And we can talk about kind of what we’re doing.

I think generally speaking, when we talk about Glue, we’re really referring to the Glue data catalog here, where they provide a central metadata repository for everything that you need inside your data lake. So instead of having your data sort of inventoried in multiple catalogs, the glue catalog gives you a place to inventory your data, but also a way to access it. When we layer things like Lake Formation, on top of that, now we can add security and governance in our data lakes.

It’s really about cataloging the data that exists in your lake, but also making it accessible through different tools. Ahana, Athena, Redshift, etc, etc, in a central way, that is easier for users to find data and access it. So that’s kind of the gist of it. Looking forward to this conversation.

Eric Kavanaugh | Bloor Group 

Yeah, let’s go around the room and have you each describe the new stack. From your perspective, I kind of hinted at this in the opening, Dipti, I’ll throw it over to you, that what we’re seeing now is quite fascinating. It’s the open source community and the committers, who are involved, really creating these components that can then be brought together into a stack. And of course, in the old monolithic way of doing enterprise software, you would have that monolithic system, which would take care of things.

Like the database, for example, would do all sorts of different things, indexing caching into, you know, pre-preparing data, analytics, all this fun stuff could be done inside of a database. But then we realized that’s not very scalable. When you start dealing with the scale of the internet today and the business markets that are that are just growing by leaps and bounds, you can’t do things the old fashioned way.

And so now what we’re doing is developing each of these components as a scale out unit itself. So you can scale out wherever you need, is it at the storage? Is at the analytical capability? What is it that you need done? Each of these components are getting that done in their own scalable way. And I think that’s the real key – Dipti is scale?

Dipti Borkar | Ahana

Yeah. It’s interesting, when we were discussing, preparing for the session, I went back to the first log I wrote at Ahana had this in there, which is, you know, holy grail of databases, right? When you start, you know, in a database class this is what you learn. You have the stack, you have the full stack, you have, all this for your clients, you have your query engine, which is your parser compiler, optimizer query rewrite execution engine, and then you have the transaction storage manager. You have a buffer pool, you have lock manager, you’re logging, and then there’s the catalog and other utilities.

What’s happened over the last few years, and in some ways, Hadoop kind of started it off. But it got extremely complicated with 70 different projects. We seem to have learned from some of those lessons where now, a cleaner stack is emerging. Where Presto, as an example, was built to be a database query engine, as opposed to MapReduce or Spark, which were more general purpose computational engines. You have this box, which is now Presto, you have some of these parts, the log manager, some aspects of locking, some aspects of the buffer pool, that are, Hudi and there’s others in there, like Delta and so on. And then you have the catalog, which manages the schema for the database, and the schema for the tables. Tables, columns, and everything we know about databases. So in some ways, the stack has been split across. And it’s now coming together as one stack with Presto, Hudi, Glue S3. We’re starting to see popularity of this stack. We’re kind of calling it the phase stack “PH”, you know, in AWS S3.

What we’re seeing is, that there’s a few reasons for this. Data lakes by themselves, which is S3, were immutable. They’re immutable, right? They’re object store. So you can’t really do anything with just that data, you need the intelligence on the top. You had query engines, like Presto, that came out that could query this data with a catalog, like a Hive meta store or Glue. But even then the data was immutable. So you really couldn’t do updates, inserts, updates, deletes, they had to be at a production level, there was some restrictions over this. So you couldn’t run the traditional data warehousing workloads on the data lake. That’s where with the new stack and the emergence of some of these new layers, you are now seeing the ability for the first time to run very traditional data warehouse workloads on a lake, right. And that’s where the real value to end users comes in.

Where you get flexibility of open formats like ORC, 4k, these are open format, so you’re not locked into a certain proprietary format. You have the ability to run other workloads on the same data without moving it around. And you get the scale. Because data is of different types, you have structured, semi structured JSON, etc. And the newer query engines, like Presto, allow you to query a range of different types of data. So you can have objects that have JSON or CSV, or orc parky, obviously, the more performing formats, and others. So you can query all of this in place and run other engines on top of that same lake. That’s the vision of this open data lake, where you get the best of the data warehouse, but you get the flexibility and the lower cost and the scale for the next 10-20 years.

That’s how I explain and at a high level, the transition over and where these pieces fit in. Back to you, Eric.

Eric Kavanaugh | Bloor Group

Yeah, and Vinoth, maybe you could share your thoughts too, because, again, what we’re seeing is a focus on each component part. You of course, are focused on Hudi these days that sort of transactional layer. Can you talk about what gets done in there and how we’re able to create greater elasticity?

Because we’ve decoupled these components, at least the hard decoupling, they’re loosely coupled now. That was also something from SOA, loosely coupling. Same principles are now being adopted at a different scale.

But tell us about your perspective, Vinoth.

Vinoth Chandar | Hudi

Yeah, that’s actually a very, very insightful question. This goes back to even the days when we were actually thinking about designing something like Hudi Aduba. So on day one, we had to support, like three engines like Hive, Presto and Spark. When you look at how things were then, each engine was good and has its own use case, and we had to design it that way. Even with respect to a catalog, we carefully designed Hudi as a transactional storage layer. While we can interact with something like Glue, or something like Hive meta store or some of these catalogs, in a very decoupled way. What this allowed us to do was actually horizontally scale, let’s say the writing or the indexing capabilities or everything, with the elastically. We can have 1000 cores ingesting data in while you can choose however many and you can pick an engine of choice for your query. And this allowed for greater flexibility for us.

And it also, in my opinion, unlocked very different use cases that are probably not possible in the traditional [inaudible] as well. For example, if you were to want to lower the data latency in data lakes, with the horizontal scalability that you get from this kind of decoupled model, you’re able to actually, you’re easily able to choose to throw, let’s say, more executors, or course or the writers to achieve the latency that you want. It’s very tunable. You don’t have to, like downsize and resize a warehouse, you can just focus on, sizing your warehouses based on a steady query workload. I think this model in general, gave up long running servers. And we can get into what we really lost from that. But we gained a lot more scalability and elasticity, in my opinion.

Eric Kavanaugh | Bloor Group  16:59

Do you mind diving into that real quick? When you say “we gave up long running servers, we get a lot more.” What do you mean by that exactly?

Vinoth Chandar | Hudi

Yeah. So if you look at this architecture, there are no long running servers in the data plane. What I mean is, if you look at a data warehousing architecture, let’s see when take Vertica, or any cloud warehouses, there’s a bunch of servers, which you do RPC on, right, and then a query hits them, there’s a node, which plans the query within that borrows in cluster distributes it, they’re able to, for example, cache a lot of metadata over in memory. So it can be much faster, to access meta data.

So in this architecture, these caches are sitting with either Hudi or within Presto, each layer is caching some of the parts of this, but, long running servers, which can do transaction management, can probably like give you more features. Like something like multivariable transactions, or let’s say a record level locks that you find in like traditional lock managers, all those are implemented in using some kind of like in memory locks. So we’ve given up some of these things.

But again, what the last four years of building this community out and supporting our use case, has shown them as for analytical workloads, we probably don’t need them as much. That’s what we learned from that.

Eric Kavanaugh | Bloor Group

Yeah, that’s great. That’s a really good insight, because it, it kind of explains what we’ve realized along the way, but to your point, and this is kind of what Stonebreaker was saying 16 years ago, he said look with the whole industry defaulted to a certain model – and he joked about it and asked it why it  happened – he took a dig at sales and marketing people. He’s like, “well the marketing people couldn’t get the messenger straight. It was just easier to say it this way.” I thought that was kind of funny. But his point was that you have different use cases. And like you just said, sometimes you don’t need these, these older services that we’ve grown reliant upon, especially if they’re not required in a in a particular use case.

So you’re enabling the sort of heterogeneity of use of the platform, which can be great in terms of performance for all kinds of people.

But let’s turn it over to Roy Hasson from AWS can you walk us through Glue and how that has evolved and how it’s flushing out the stack?

Roy Hasson | AWS

Yeah, so I mean, Glue itself and we’re, you know, in particular, the Glue Data Catalog is, you know, kind of started off as, as a hive compatible meta store that really tries to simplify the way that customers manage metadata in their in their environment. So we had a lot of customers on premise and even migrating to AWS. They’re managing these Hive meta stores on top of Amazon RDS, or self managed databases, and it’s a critical component of the of the entire solution. If you don’t have metadata, nothing is probably going to work and it’s not going to work well. But also, managing those databases is just, it’s a pain. There’s no need to do it, you have to build replication, etc, etc. It’s not something you really want to do. So when we created the Glue Data Catalog, we basically came in, we said, okay, it’s a critical component, we need to make sure that it works well. But we need to make it serverless so the customer doesn’t have to worry about it. And the integration was a really key aspect. We didn’t want to just create our own set of API’s and say, hey, everybody go ahead and integrate with it.

So we chose to be Hive compatible, in a sense that our API’s are very, very similar. So if you’re using Spark, or Presto, or whatever that may be, you can plug into the Glue catalog with a lot of development or a lot of complexity. So that was, I think that was the tipping point of the glue data catalog to say, now we can start plugging into more and more systems. So tools like Databricks, and Snowflake, and you know, of course, Ahana and lots of others are integrated with the Glue data catalog, which makes data access simpler. And I think the overall picture here and Dipti and Vinoth kind of talked about this – We’re talking about these technologies and breaking the monolith and making it more decoupled. So we can have scale and performance and cost across the board.

But the one thing that we have to remember, and I really I really believe that this stack helps there, is how do we make it simpler? Easier to use? Yes, there’s a lot of moving parts. Yes, there’s benefits to all of these things. So we don’t want to give that away. But we also don’t want to give away the ease of use that we need. And Hudi coming in and saying, We got the data, right, if stuff comes in to the lake, we’ll manage it for you. We’ll update we’ll insert, we’ll delete, we’ll do compaction, we’ll do all that stuff that in the past, you’d have to build ETL processes, I talked to customers all the time, I got this massive ETL job that copies data to a staging directory and then copies it to the production directory, it’s a pain customers have to stop and wait. Hudi just kind of does away with all these things. And that’s the benefit of this stack. Is that you get you get decoupling, you get scalability and performance. But you can say to the IT team, it’s covered. You don’t have to do anything, Hudi’s doing all this heavy lifting on the on the data side, Catalog is just managing all the metadata for you. And then Ahana is Presto, just plugs on top, and just queries the data. And if you want more to say hey, but I want to do some Query Federation, I want to extend I want to grow beyond just data in S3, Presto and Ahana makes it makes it that much easier to do.

So I think ease of use is kind of like the bow on top of this whole package makes it much easier for companies to consume.

Eric Kavanaugh | Bloor Group

And it really is a new way of – –

Dipti Borkar | Ahana

A couple of things to add to that, Eric, you know, one of the things that Vinoth said is, you know, we have a little bit more flexibility on the analytical side. And that’s, that’s actually important to understand. Because, you know, if I go back to this previous chart here, the original databases were built for OLTP workloads. So they were very rigid, these were business transactions, ACID compliance was extremely important. You needed no multiple levels of isolation, again, very important.

But on the analytics side, you can give a little bit, and that’s where, because of this flexibility, we are now able to disaggregate the system a little bit more and these workloads. And overtime that that will change, right, that’s evolving as well. The current state, we may not have all the asset compatibility that you get with an OLTD system, but that’s okay. We didn’t even have the ability to insert, update and delete. So now it’s at a point where not only is it simple to use, to Roy’s point, and I’ll talk about managed services at Cloud that has transformed that, as well. But it’s also this flexibility with these, because we’ve you know, given a little bit to some of these hard constraints, we are able to run these workloads on this new stack. Which wasn’t possible before. And so, that is kind of an important aspect.

Now, the ease of use is why you know, one of the reasons why I created Ahana it was very hard to do kind of SQL on S3. There with Hadoop very many, many different components. You know, many different projects, but when it’s all together, integrated into a marriage service, and just seamlessly fits in with other services like Glue – Lake formation and others, it just it makes the life of a data platform engineer much easier. Because you’re not managing the operations of the system, on your own. And that’s really important. Because we saw that with, with Hadoop, it took six months, nine months for even projects to complete. And even at that point, you didn’t really use the system fully, there was a lot of time that got spent on the in the operations of managing the system and, and the hive meta store, and all of these, all of these different aspects, but with this new world, and managed services, that has simplified the lives of platform engineers.

So you no longer need to, you know, everything comes kind of built in into the system, Hudi is doing its part is doing the compactions as needed, you can obviously, you know, schedule those and there’s API’s for things, Presto is doing its part where you can do you have auto scaling, you have the ability for cost management, if the cluster is not doing anything, it can go into idle state, Glue is managing its thing. And so that is very important. Because at the end of the day, when you have a stack, you want to see value from it right away. You want to see value of it, where you get insights from the data, that’s the outcome. The outcome matters. And because of these different managed services, they plug in well together. And now a three person Data Platform team, we have a lot of customers that are running this stack, they have a two or three person data platform team, and they are able to run this, which was impossible to do two, three years ago.

So the cloud has transformed and managed services have helped with the adoption of this stack. And go from a highly complicated, many different components, where you have to have figured out the integrations, all of it fits in well together now.

Eric Kavanaugh | Bloor Group

Yeah. Yeah, it’s very interesting, too, because it used to be that this one component was the constraint. And that was the choke point, basically throw this over to Vinoth Chandar of Hudi, and now we have much more versatility. But you’ve also to a certain extent, future proofed the architecture. Because I think one of the constraints of going with the old way, is that to do something new is very difficult and challenging. Ripping and replacing is always a very, very painful thing. Nobody wants to do that. So what we’ve kind of seen here is a whole separate architecture grow up around the existing systems. And now we’re mostly using this stuff for net new, but a lot of times you are using these new stacks for offload. So that traditional data warehousing workloads, now they really can work in this new environment that can ease some pressure on your data warehousing team.

I think the challenge is that from an organizational perspective, it’s not just technology, it’s people, its budgets, its human resources, it’s all these different things that amount to de facto constraints. And if you can, if you can expand the capability of the stack, while bringing down the number of people necessary, well, that really enables each individual to do a whole lot more.

What do you think about that Vinoth?

Vinoth Chandar | Hudi

Yeah, definitely. For example, in this model, right, going back to how we decoupled this access to storage. And for example, when you want interactive query performance, you deploy something like Presto, which has, long running servers, it can do a whole bunch of caching internally of data, metadata and speed of queries. But if you want to do, let’s say, your data science, or machine learning workloads, a good chunk of those workloads are data preparation, and kind of like more complex feature extraction, like ETL jobs, right? You are now able to access raw S3, like, you know, with very little overhead. Hudi, for example, be that lightweight transaction layer, which gives you the latest snapshot of a table. And you can actually, scan the data at S3 speeds, right? Without getting bottlenecked by a lot of server tear in front of it, like how you would do it if you were to say access data in Snowflake from another Spark cluster, right? So you have to pay for both. And then you’ll still be limited to the size of the let’s say, the barrows cluster.

This really does make it more general purpose architecture that can support both analytics, as well as the emerging data science machine learning workloads. That said, I think we’ve solved it fairly well for structured data, and maybe like semi structured data, so there’s a lot of data beyond this right which is like the Like we look at computer, vision and AI, and all of the growth that has happened there. I think very recently, TensorFlow, learned how to pack a file [inaudible]. So we’re still far away from doing this for a whole bunch of data that is not even tackled by warehouses at this point. And then this sets us up for a very nice future where you have open data. Then you have a lot of choice around what engines you want to pick at what price point and what the capabilities that the stack offers. It’s not just about performance in a lot of cases as well.

So yeah, I that’s why I believe in this stack as the is the kind of the future for how we like do data and industry.

Eric Kavanaugh | Bloor Group

Yeah, that’s very cool. And Roy, I’m going to bring you into this. I’ll go down memory lane here, again, which I always love to do, just kind of reexamining my own learning curve. And how we got here. And I remember, 20 years ago, working with a number of enterprise software companies, a good friend of mine was running one. And he was talking about metadata and metadata, cataloging, and so forth. And I said to you, I’ve noticed that all these different companies have their own metadata repository. And it’s good that they have that. But wouldn’t it be better if you had a sort of unified metadata repository that different companies could access to? And that expedite reuse of data and mixing and matching all this? And he kind of laughed, He’s like, Well, yeah, I guess so. But that’s not going to happen. And I kind of think that sort of now finally, happening in part because of the cloud. Because we have that we have the scale.

We have the resources, of course, Amazon Web Services, hats off to those folks for getting a 10 year head start in the competition. I don’t know how that happened, but it happened [inaudible].

I’ll throw it over to Roy to kind of comment on that. Are we finally getting to a point where that metadata management component is so robust, and so versatile, that we can stop reinventing wheels to a certain extent?

Roy Hasson | AWS

Yeah, I mean, I think generally speaking, there’s sort of two paths that are being taken by customers, the first one is really focused on data discovery, and search. And the second one is around data access and security and governance. Those two right now are still fairly separate tracks. But on the data discovery, data search, you’re seeing lots of open source tools like a Munson and Nemo, and you know, you name them, Data Hub, that are really focused on simplifying the discovery, the cataloguing of data, and then making them easy for users to come in and search and discover and understand and annotate and collaborate on these kind of things. But they’re not really solving the problem of data access and security.

We’ve had, we’ve had these tools around like Collibra, and Elation, they’ve been around for a long time. And they’ve done a really good job at providing this type of catalog. But when a user goes to query the data, let’s say to go to Ahana and do Select* from something. How do I satisfy that? I still need to be able to have a Hive meta store or some meta store that can serve the access needs. And that’s where Glue catalog really comes in and say, Hey, we’re going to solve the discovery and the cataloging aspect of data. But we’re also going to give you the option to access the data from your choice of tool. Now, of course, we have, we have lots of room to grow. It’s not perfect, you know, some of the tools out there, some have really awesome features and doing a really good job.

But again, right now, what I see in the market, as these two paths are kind of running in parallel, eventually, they’re going to start merging together. But customers are definitely seeing the value because there is more data, right? There’s a lot more data, there’s a lot more ways of accessing the data. So the bottleneck becomes finding it, and understanding it, and understanding, should I use this? Is this fine and stable? Good? Is it have good accuracy of the data? Is it something that Roy randomly created, and he has no idea how to do math? You know, so that’s something that you got to make sure that you do, right. And that’s why these catalogs are becoming more and more of a central focus for our customers.

But the other aspect to that I’ll kind of add into this as a security in the governance. Once you’ve done that, you know, you may go into the catalog and say, well, Roy can only see these specific tables, how do you actually enforce it? You still got to go back in and enforce someone Roy runs a query in Ahana, or in Athena, or in Redshift, I can consistently enforce those permissions without having to duplicate those policies in these systems. And that’s where the Glue Catalog with Lake Formation really comes in and centralizes all of that together.

Dipti Borkar | Ahana

We haven’t talked about security at all. I mean, there is a big box that fits around what we were looking at – the stack – on security, because as soon as you have a Lake, where you’re seeing all your data in the lake now, and across the enterprise, it’s streaming data, it’s enterprise data, it’s IOT data, it’s third party data, it’s all your data, you absolutely need governance on top.

There’s different approaches to that. And that goes to the operational catalog. So there’s, like Roy was saying, the operational catalog, the either the Give meta store, the Glue, it really is a mapping between databases, and tables, and objects. Because with SQL, you can’t query objects directly, you have to have some sort of a mapping. And that is that is kind of a foundational element. On top of that, there’s Apache Ranger, right, which has authorization. That’s coming up more and more, there’s obviously authentication mechanisms, like LDAP, and, you know, other SSO that needs to be built on. And then with managed services, like Lake Formation, they’re simplifying that by adding governance right on top of the storage layer, and taking those concerns down from the from the query engine or on the top of the stack to the [inaudible].

We’re seeing some innovations and movement in the stack, actually, that makes it a lot more flexible; for longer term integrations with other engines, with multiple different types of data processing on top. That’s an area that, you know, we’ve just kind of scratched the surface. So far, in the next two, three years, there will be more innovation in the security governance space with data lakes.

Eric Kavanaugh | Bloor Group

Now, that’s a really good point too Let’s dive into the synthesis, because I was thinking to myself, Dipti, there is this great German concept of Gestalt, which basically means the sum is greater than the whole of its parts. And again, here, you have groups working on specific components, but the magic happens when you bring them all together. Because what you’re doing is, you’re sort of resolving old problems with the new technology stack. But you do need to have that thorough vision from the top to the bottom.

Right, that’s kind of what you’re alluding to, I think, can you talk about the importance of, of appreciating the fullness of the stack and staying on top of the different components as the as they evolve?

Dipti Borkar | Ahana

I can go first. Absolutely. I mean, you know, it’s like one plus one is three in this case, you know, one plus one plus one plus one is 10. With the four components we’ve been talking about, it gives you at the end of the day, it’s the outcome? If you put yourself in the shoes of a data platform team, or a Data Platform Engineer, what we see is they’re looking for lower costs. Good enough performance, security, and the ability to run all the workloads that they have been running with the ability to run even more advanced workloads in the future. That’s what these, this stack essentially enables.

The query engine is getting more advanced as well where with Presto, you we are doing more push downs we’re doing we’re adding more of the more traditional, databases have been around for 30 plus 40 years, [inaudible] was written 1970. And there’s a lot of innovation from there that needs to be folded, just like the transaction piece is coming in with Hudi, the query engine themselves will start getting more advanced. There’s more that needs to be done there. As each of these layers become stronger and stronger, and can handle more of these workloads, we talked about transactionality acid compliance, we’ve kind of scratched the surface there, you will start to see a much bigger move to Data Lakes.

Today, it might be an augment strategy, where maybe you use a Snowflake or Redshift, right for 20% of your workloads for more reporting and dashboarding. And for use cases, you use Presto for the interactive ad hoc query analysis, some see some data science some you know SQL notebook kind of workloads you use you use spark for deep transformation ETL and others in the lake.

Over time, we’ll see a much bigger move as the stack evolves and becomes much more stronger. That’s how I see this evolving. I think OLTP will stay the same. There will be some high performing databases that will always be there. But from a warehouse perspective, the lake will consume a lot more workloads.

Eric Kavanaugh | Bloor Group 

Okay, good. And maybe, Vinoth, if you could explain in some more detail: What are the possible use cases for this transaction layer? What are some of the things that you’re now able to do in Hudi, that you used to be able to do in traditional database systems? Or still do quite frankly, traditional database systems? What are some of the examples of what this transaction layer can do for a company?

Vinoth Chandar | Hudi

I just score, right, like we mentioned, we saw previously, we were not able to do even single table transactions. And when we write to do the lake, like how we used to do with databases, that that’s the basic thing that we started solving. But specifically in Hudi we focused on actually adding indexing capabilities, and we have a file layout, which is very conducive for sort of, you know, fast inserts and deletes, when we designed Hudi, we wanted to make it almost be like OLTP (ish) performance for update, delete workloads. It’s still like batched. It’s not the same low performance as your regular multi database. But compared to where even [inaudible] and the data lake was, I think that’s one part that organizations can benefit from it.

While in 2016, when we started the project, we solved mutability and transactionality as a means to an end for solving incremental processing. Our biggest problem at Uber back then was we had all these like big batch jobs, we need to instrumentalize them. So right now, if I look back, I think we’ve solved incremental data ingestion pretty well. We can deploy something and it’s self-managing, works out of the box. The second part is what we bring, just like databases bring CDC, we bring a lot of CDC capabilities to Data Lakes, that is something that Hudi uniquely brings to the table. Right now in industry, where you have record level change streams, just you can consume with RDBMSs. What this opens us up is like a reimagination of data processing, using our kind of like an optimized storage layer like Hudi. Now you can do stream tables, like joints in the data lake.

Then these frameworks like Flink and Spark, and like Beam and they’re evolving to generalize batch processing as a, you know, in terms of the stream processing API’s, if you will. So that will be a very interesting next few years, as we go towards that, I believe a lot of batch processing, today, is still kind of reprocessing a lot of data. Typically the way we do batch processing is, you know, take last n partitions of the data, run it over, the last 10 partitions. So we literally added broad batch processing operations to Hudi late last year. We believed in this kind of incremental vision. But yeah, I think the things that are coming together now. Where organizations would be able to drop a lot of their data processing, like compute spend, by adopting a more incremental model.

And this is all made possible, not just the transactional capability, but the fact that we designed for, fast update, deletes, we can absorb delta changes quickly. And then also, hand out best in class CDC collect, log for other downstream data processing.

Eric Kavanaugh | Bloor Group

That’s fascinating. It’s really cool stuff. I’ll turn this one over to Roy, we have a couple of good questions from the audience. I’ll read a couple of them are talking about the Delta Lake too. So I’ll just read one to you, Roy, and then maybe Dipti, after Roy comments, you can comment on this too.

But speaking of open source projects, I can’t help to compare Apache Hudi versus Delta Lake. I assume AWS is leaning more towards Hudi since they made it available as a connector in the AWS Marketplace.

Can you talk about that from your perspective.

Roy Hasson | AWS

I’ll let Vinoth talk about kind of the differences between them, from a technical perspective, but from a from a positioning perspective, our intention is to support all popular formats by our customers. You know, Hudi works great with EMR, we made certain choices around including Hudi with Amazon EMR, we support it in Glue. Delta Lake is also supported. also works fine with EMR and you can make it work with Glue as well.

We continue to work with both the Hudi teams and the Delta Lake and also the Iceberg team to start building more integration for these formats into our services. Because customers are asking for them. So there’s no one good answer. And you know, of course we have our own that we announced – Lake Formation Governed Tables – to try to solve some of the complexities in some of the issues that exist today in the current formats.

It just gives customers options, but I’ll let Vinoth talk about the main differences.

Eric Kavanaugh | Bloor Group

Yeah, if you would Vinoth, go ahead.

Vinoth Chandar | Hudi

Yeah. So Delta Lake was with Hudi. Technically, I would I like to keep it short. So there’s technical, a lot of technical differences. One as, for example, Delta Lake supports what in Hudi, we call as copy on write storage where there’s a higher write amplification, but you pretty much work in parquet files. Whereas in Hudi you get more flexible, modern read kind of format, which lets you actually absorb updates, the leads come like asynchronously compact them later. And then the transactional model for something like Delta Lake is strictly optimistic concurrency control, which in my humble opinion, is not a great one lead choice when you have, long running transactions in the kind of the data lake ecosystem.

And so Hudi was designed with some like more MVCC based concurrency control, where just like a database, we differentiate between actual writers, external writers to the table, unlike internal processes, which are managing compaction and clustering. How a database would do coordination between cache manager like buffer pool manager and a locking thread, Hudi has that runtime around it.

As a project, we have significant like Apollo functionality that you get for free in open source, right, a good chunk of this functionality in Delta Lake is locked to the Databricks runtime. So with Hudi, you get all this for free. You can run it on any cloud you want on any like Spark server that you want. Even on Databricks, you can run all of this. That’s how I would say, Hudi, at this point is a more complete platform, if you will. I wanted to also take this point to like, talk a little bit about table format. So we literally put it up a blog today, clarifying what the product stands for. In the last few years, there’ll be a lot of activity around table format, in my opinion, at least, just building another format is a great step. It gets rid of a lot of bottlenecks in data access layers, like file listing, which can slow metadata. But honestly, if these were solved at meta store layer, formats don’t have to solve this within them.

So that’s sort of formats have their place, but I feel they’re still a means to an end, in the grand scheme of things, if we were to have the same level of usability and reliability and ease of use for a data lake users that they have the [inaudible], we need a more well integrated stack on top. And that’s what Hudi is building. Over time, we actually, for example, we had a ticket to plug Iceberg as an option under the Hudi runtime, if you will, for a while now. We are open to even working on top of other formats. Of course, users have to give up some benefits, like Iceberg only supports optimistic concurrency control. So your compaction will lock and will fail your ingestion and these things may happen. So as long as you’re, you’re okay with the trade offs.

I think over time, I feel like we should have more standard API’s across these formats and actually also build more layers on top in a cohesive way. That’s where I would push on the projects and where they are pretty different in terms of where they are going.

Eric Kavanaugh | Bloor Group

That’s fascinating stuff. I mean, really, it’s very, very interesting. Dipti, what, what advice do you have for folks to stay on top of what’s happening? Because I’ve been tracking open source now on and off for 15 years. And in the last three to four years, it’s borderline bewildering how much innovation is happening in different camps.

But in fact, one of the attendees asked an interesting question, that you can maybe riff off of, but the attendee is saying – what about all the other cloud environments where, you know, Google and Microsoft, are we going to see new walled gardens? Is this the new sort of Age of Empire, or instead of the old IBM, Oracle, SAP now we have Google, Microsoft, Amazon? How much of that analogy holds from your perspective?

Dipti Borkar | Ahana

Yeah, it’s a good question. Right? There were the three big database companies and now that’s changing. You still have data warehouses in each of these different clouds. But with the with these open source stacks, like Presto, that can log into Hudi, and we plan to add the other table transaction managers as well, like a Delta, etc. There’s Hudi  that plugs into multiple engines on the top and multiple format below it. Every layer, every component in the stack, the beauty of it is, it fits in with multiple components on the top, if there are components on the top, and multiple components below. And that’s what you get when you take open source foundation, oriented open source projects.

Apache will always be open. Apache Totoro license, it’s you’re never going to get locked in. Presto, but on Linux Foundation, it’s always going to be open. It’s a community driven project. Users have to think through what is important for them. Are they looking at just one solution that does everything, and may not do everything well? But does everything? It’s a kind of a traditional platform, approach.

Databricks is trying to go with that approach where it’s trying to solve everything in this data lake space. Or you could pick the best engine and the best tack for the four broad workloads that you’re trying to solve. So for example, with interactive querying, with a reporting and dashboarding federation on a lake, a Presto Hudi stack makes more sense. Transformation? Spark, might make more sense. For machine learning, TensorFlow might make more sense. So I would advise platform teams to think through, what is important for them? Performance characteristics, formats, open formats that they want to support in the storage layer? And, of course, which cloud? Because even though I think all of these are multi-cloud technologies, you can run them on every cloud. Most teams have a primary cloud. They have a primary cloud, and then they may have some secondary clouds. And so figure out for the cloud that you’re on, what is the best option? What is the best stack, right?

And this is, we’re mostly focused on AWS, is a great stack for AWS. For Google, it might be something else, they have data proc, you know, these layers feed into data proc as well, Presto, can run on data proc as well. And so those are some of the things that are bring up. That platform needs should think through Eric.

Eric Kavanaugh | Bloor Group

Okay, good. We have another good audience question I’ll throw over to Vinoth. One attendee is asking, Does Hudi fit in with the Cloudera stack? And how would you say it compares or complements Apache Kudu?

Vinoth Chandar | Hudi

I’m not super familiar with all of the chlorella stack, but high level? Yeah, it can run on top of HDFS or on top of – you can query from Impala, even today. It will run like the all our jobs will run on yarn. So I think it’s compatible with the Cloudera stack retriever. So the Kudu question is very interesting. In fact, that Kudu was something that we were evaluating at that time. Before we were writing Hudi. The thing with Kudu is, it still feels like a specialized data store. I would lump Kudu more in terms of, it needs SSD optimized storage for its, it kind of gives you like absurds, which are always, like more closer. I haven’t tested it fully.

But my understanding from the paper is that you have SSD storage, so you can they can optimize for both updates, as well as do scans better and stuff. So it feels like a specialized storage engine for analytics, evaluated together with, let’s say, the Druid the other real time analytics engines of the world and see how it fits together. Hudi is designed more for, hey, it is the general purpose data transaction, transactional storage layer for you, where you can manage all of your data, right forever. And this was the one of the reasons why we decided to write Hudi. Because we couldn’t really see how, at least for Uber, value that had, I couldn’t see how I would future proof this. Meaning, what if we were to move to a Cloud Object Storage in the future? Uber has 250 petabytes store like this today, or something like that.

I don’t want to run a Kudu cluster that big, or not just any other cluster that big. So again, going back this decoupled layer is awesome because you can build team was just scaling on managing all the data plane. And then there are separate teams who can bring in Presto, they can work on Presto, this whole model was, I think, much more scalable. I thought back in the day, and honestly, when I was trying to do it, I was alone. But over time, I’ve seen Delta Lake, did a similar thing. Which is, hey, here’s we’re going to decouple a transaction layer written in Spark. That’s kind of like how, what would we had already done before. So yeah, I think this model scales lot more for a general purpose  data layer, I would say.

Eric Kavanaugh | Bloor Group

That’s great. Well, folks, we burned through an hour here, I want to throw one last question at each of you to kind of summarize and be a bit forward looking perhaps, but Roy, I’ll start with you. I had this realization, a few months ago, I guess I’ve been thinking about it for a while, that we really are entering, we’re now in a new generation of enterprise computing. And I described as four generations basically,

First is the mainframe. Second would be client server. Third is original ASP generation application service providers, where you know, Salesforce was kind of the big poster child in that. And now what we’re seeing is this fourth generation, where you don’t want to go build some stack yourself to do all these different component parts, you want to let companies like Amazon, quite frankly, and even Uber and Google and these other major vendors, because they open source this stuff, let them build those services. And then you leverage those services.

So you build a sort of vision or a framework on top of these existing services that Amazon can provide, and others. And that’s a whole new way of doing things. Well, what do you think about that assessment? Roy?

Roy Hasson | AWS

Yeah, no, I think that I think that’s generally true. You know, AWS is building these building blocks. So folks like Dipti and Vinoth can build and innovate on top of that. We’re talking to our customers all the time, we’re trying to find the simplest and the best way to solve their problems. And sometimes we can do it with our native services. Sometimes we just create these building blocks. And then partners like Ahana come in and puts it all together. Puts a nice bow around it and say this is this is the best way to do it. So that’s, that’s a great ecosystem.

And I think, as we continue down this path of analytics, it really boils down to exposing and integrating. Exposing those API’s and integrating building these vertical solutions that claim to do everything is going back 20-30 years to the same problems we had with Oracle. Do we really want to do it now just in a shinier package? Not picking on anyone in particular, I’m just saying, we tend to sort of over bias on the ease of use and what the business needs to make it easier. And we tend to forget that when we’re building these solutions. They’re not there for a year, they’re there for years, 10-15 years. So we have to make sure that we’re also looking at the architecture, the implementation, the decoupling, because the scale problem will come back. Maybe we just kind of pushed it down the road a little bit. But the scale problem will come back and if you don’t have the right levers, you don’t have the right technology to solve those problems, you’re back to square one, just a few years down the road.

So again, with Hudi, I think it’s a great way to store the data. It’s a great way to simplify managing the data. But it also gives you portability. If tomorrow, you said hey, I want to go to a different bar, or I want to just move away from Hudi or something else, that it is an open format, you can just take and do whatever you want with it. It doesn’t lock you in. The same thing with the Glue Catalog. You know, if you want to use our service, fantastic.

If you don’t want to use it, you think something else is better, the metadata is there. There’s API’s. And I’ll say the same thing about Ahana and Athena, and all these other services. We’re using SQL, if you if you decide that you want to take your SQL somewhere else, because there’s a better, faster, cheaper engine, go ahead and do it. So this decoupling makes a lot of sense. And it’s going to save customers a ton of money and effort in the long run.

Eric Kavanaugh | Bloor Group

That’s a great point. Dipti closing thoughts from you.

Dipti Borkar | Ahana

Yeah, completely agree. Obviously, we’ve built a managed service on top of AWS, which is a incredible cloud, it gives you really all those building blocks. The control plane, we’ve used so many different aspects. We’ve Presto runs on Kubernetes, we’re using EKS. We have a serverless lambdas that are part of the control plane, so that it’s highly scalable. These building blocks allow vendors like Ahana to build new, newly designed, in the new world right control planes and out that take out some of this operational complexity and make it very easy to use. While giving users the flexibility of extending and scale at these different layers. So, if you have it’s three or four different components really, right, that’s what it’s come down to. And so it’s fairly manageable.

Given each layer is a managed service. It’s up and running in 30 minutes. In 30 minutes, you can do SQL on S3, integrating all of this stack. That is, that is the new world. And it’s, whether it’s AWS, or other clouds, the ease of use, and the flexibility of the openness is what it’s about. The way I define open data lakes, is its open format, or parquet, very important. You can move an engine move, move a transaction manager whenever you need to, you’re not locked in. As opposed to ingesting your data in some other data warehouse in another place.

Open interfaces, SQL lingua franca have a lot of tools out there, it is open, open source. You’re not getting locked into a proprietary engine of these layers on the top. Storage is S3, commoditized it 15 years now. And that’s taken care of, and then open clouds. So we’ll talk about AWS largely today. But you know, it is a multi-cloud and you the staff are drawn in different clouds as well.

Eric Kavanaugh | Bloor Group

I love it, I want to give Vinoth one chance to give his closing thoughts as well. What does the future hold for you? And for Hudi?

Vinoth Chandar | Hudi

Yeah, so I think we think we will continue to build out this, make this data plane, Open Data Plane, as easy to build as possible. Then reflecting on some of what was like dry in the dimension, right, I want to look at a little bit backward looking, if you look at all of a lot of the technologies that we use today Hive or Parquet or some of these things, these are actually born in the Hadoop era, if you will. But that’s the beauty of this model is that over time, these things are in open, when vendors don’t do a great job, others can step up. People can operate managed EMR, or Presto as a service. Then we have this opportunity to keep building towards this vision over a long period of time. As opposed to if we keep going down this path of closed walled garden, is the way of the future, then we know that right? Like big companies saturate after a point, innovation slows down. And then people move out.

So all of this will inevitably happen. And what will end up happening is the innovation across the board will be affected by that. So our goal here is to keep making this data plane better and better and better so that you can painlessly get started with bringing your data. and then use anything that you want. So that’s, that’s the principal and I think we have our cross cut out for the next one, two years in terms of kind of matching all the walled gardens in terms of the usability. I say this while realize you knock things on just openness, but that’s why we emphasize a lot on the well avail integrated stack inside the Hudi project, and we’ll continue to do so.

Eric Kavanaugh | Bloor Group

I’d love it will folks, look all these guys up online. This is the new stack. It’s another way of looking at how to build out your future. It’s always changing. But I think these folks understand the critical importance of interoperability, and componentization, if you will. So we do archive all these webinars. Thanks for joining us. Thanks, Dipti for the invite. Thank you, gentlemen, for joining us today. Great questions from the audience. I’m sure we’re going to have much more to talk about over the next few years. Thank you.

Dipti Borkar | Ahana

Great, thanks, everyone. Bye.

Speakers

Vinoth Chandar
Creator of Hudi

Roy Hasson
Principal Product Manager

Dipti Borkar
Cofounder & CPO

Eric Kavanagh
Moderator

Tutorial: How to run SQL queries with Presto on Google BigQuery

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial is about how to run SQL queries with Presto (running with Kubernetes) on Google BigQuery.

Pretos’s BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.

Step 1: Setup a Presto cluster with Kubernetes 

Set up your own Presto cluster on Kubernetes using these instructions or you can use Ahana’s managed service for Presto

Step 2: Setup a Google BigQuery Project with Google Cloud Platform

Create a Google BigQuery project from Google Cloud Console and make sure it’s up and running with dataset and tables as described here.

Below screen shows Google BigQuery project with table “Flights” 

Step 3: Set up a key and download Google BigQuery credential JSON file.

To authenticate the BigQuery connector to access the BigQuery tables, create a credential key and download it in JSON format. 

Use a service account JSON key and GOOGLE_APPLICATION_CREDENTIALS as described here

Sample credential file should look like this:

{
  "type": "service_account",
  "project_id": "poised-journey-315406",
  "private_key_id": "5e66dd1787bb1werwerd5ddf9a75908b7dfaf84c",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwgKozSEK84b\ntNDXrwaTGbP8ZEddTSzMZQxcX7j3t4LQK98OO53i8Qgk/fEy2qaFuU2yM8NVxdSr\n/qRpsTL/TtDi8pTER0fPzdgYnbfXeR1Ybkft7+SgEiE95jzJCD/1+We1ew++JzAf\nZBNvwr4J35t15KjQHQSa5P1daG/JufsxytY82fW02JjTa/dtrTMULAFOSK2OVoyg\nZ4feVdxA2TdM9E36Er3fGZBQHc1rzAys4MEGjrNMfyJuHobmAsx9F/N5s4Cs5Q/1\neR7KWhac6BzegPtTw2dF9bpccuZRXl/mKie8EUcFD1xbXjum3NqMp4Gf7wxYgwkx\n0P+90aE7AgMBAAECggEAImgvy5tm9JYdmNVzbMYacOGWwjILAl1K88n02s/x09j6\nktHJygUeGmp2hnY6e11leuhiVcQ3XpesCwcQNjrbRpf1ajUOTFwSb7vfj7nrDZvl\n4jfVl1b6+yMQxAFw4MtDLD6l6ljKSQwhgCjY/Gc8yQY2qSd+Pu08zRc64x+IhQMn\nne1x0DZ2I8JNIoVqfgZd0LBZ6OTAuyQwLQtD3KqtX9IdddXVfGR6/vIvdT4Jo3en\nBVHLENq5b8Ex7YxnT49NEXfVPwlCZpAKUwlYBr0lvP2WsZakNCKnwMgtUKooIaoC\nSBxXrkmwQoLA0DuLO2B7Bhqkv/7zxeJnkFtKVWyckQKBgQC4GBIlbe0IVpquP/7a\njvnZUmEuvevvqs92KNSzCjrO5wxEgK5Tqx2koYBHhlTPvu7tkA9yBVyj1iuG+joe\n5WOKc0A7dWlPxLUxQ6DsYzNW0GTWHLzW0/YWaTY+GWzyoZIhVgL0OjRLbn5T7UNR\n25opELheTHvC/uSkwA6zM92zywKBgQC3PWZTY6q7caNeMg83nIr59+oYNKnhVnFa\nlzT9Yrl9tOI1qWAKW1/kFucIL2/sAfNtQ1td+EKb7YRby4WbowY3kALlqyqkR6Gt\nr2dPIc1wfL/l+L76IP0fJO4g8SIy+C3Ig2m5IktZIQMU780s0LAQ6Vzc7jEV1LSb\nxPXRWVd6UQKBgQCqrlaUsVhktLbw+5B0Xr8zSHel+Jw5NyrmKHEcFk3z6q+rC4uV\nMz9mlf3zUo5rlmC7jSdk1afQlw8ANBuS7abehIB3ICKlvIEpzcPzpv3AbbIv+bDz\nlM3CdYW/CZ/DTR3JHo/ak+RMU4N4mLAjwvEpRcFKXKsaXWzres2mRF43BQKBgQCY\nEf+60usdVqjjAp54Y5U+8E05u3MEzI2URgq3Ati4B4b4S9GlpsGE9LDVrTCwZ8oS\n8qR/7wmwiEShPd1rFbeSIxUUb6Ia5ku6behJ1t69LPrBK1erE/edgjOR6SydqjOs\nxcrW1yw7EteQ55aaS7LixhjITXE1Eeq1n5b2H7QmkQKBgBaZuraIt/yGxduCovpD\nevXZpe0M2yyc1hvv/sEHh0nUm5vScvV6u+oiuRnACaAySboIN3wcvDCIJhFkL3Wy\nbCsOWDtqaaH3XOquMJtmrpHkXYwo2HsuM3+g2gAeKECM5knzt4/I2AX7odH/e1dS\n0jlJKzpFpvpt4vh2aSLOxxmv\n-----END PRIVATE KEY-----\n",
  "client_email": "bigquery@poised-journey-678678.iam.gserviceaccount.com",
  "client_id": "11488612345677453667",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x505/bigquery%40poised-journey-315406.iam.gserviceaccount.com"
}

Pro-Tip: Before you move to the next step please try to use your downloaded credential JSON file with other third party sql tools like DBeaver to access your BigQuery Table. This is to make sure that your credentials have valid access rights or to isolate any issue with your credentials.

Step 4: Configure Presto Catalog for Google BigQuery Connector

To configure the BigQuery connector, you need to create a catalog properties file in etc/catalog named, for example, bigquery.properties, to mount the BigQuery connector as the bigquery catalog. You can create the file with the following contents, replacing the connection properties as appropriate for your setup. This should be done via the edit config map to make sure its reflected in the deployment:

kubectl edit configmap presto-catalog -n <cluster_name> -o yaml

Following are the catalog properties that need to be added:

connector.name=bigquery
bigquery.project-id=<your Google Cloud Platform project id>
bigquery.credentials-file=patch/for/bigquery-credentials.json

Following are the sample entries for catalog yaml file:

bigquery.properties:  |
connector.name=bigquery
bigquery.project-id=poised-journey-317806
bigquery.credentials-file=/opt/presto-server/etc/bigquery-credential.json

Step 5: Configure Presto Coordinator and workers with Google BigQuery credential file

To configure the BigQuery connector,

  1. Load the content of credential file as bigquery-credential.json in presto coordinator’s configmap: 

kubectl edit configmap presto-coordinator-etc -n <cluster_name> -o yaml

  1. Add a new session of volumeMounts for the credential file in coordinator’s deployment file: 

    kubectl edit deployment presto-coordinator -n <cluster_name> 

Following the sample configuration, That you can append in your coordinator’s deployment file at the end of volumeMounts section:

volumeMounts:
- mountPath: /opt/presto-server/etc/bigquery-credential.json
  name: presto-coordinator-etc-vol
  subPath: bigquery-credential.json
  1. Load the content of credential file as bigquery-credential.json in presto worker configmap: 

kubectl edit configmap presto-worker-etc -n <cluster_name>  -o yaml

  1. Add a new session of volumeMounts for the credential file in worker’s deployment file: 

kubectl edit deployment presto-worker -n <cluster_name> 

Following the sample configuration, That you can append in your coordinator’s deployment file at the end of volumeMounts section:

volumeMounts:
- mountPath: /opt/presto-server/etc/bigquery-credential.json
  name: presto-worker-etc-vol
  subPath: bigquery-credential.json

Step 6: Setup database connection with Apache Superset

Create your own database connection url to query from Superset with below syntax

presto://<username>:<password>@bq.rohan1.dev.app:443/<catalog_name>

Step 7: Check for available datasets, schemas and tables, etc

After successfully database connection with Superset, Run following queries and make sure that the bigquery catalog gets picked up and perform show schemas and show tables to understand available data.  

show catalogs;

show schemas from bigquery;

show tables from bigquery.rohan88;

Step 8: Run SQL query from Apache Superset to access BigQuery table

Once you access your database schema, you can run SQL queries against the tables as shown below. 

select * from catalog.schema.table;

select * from bigquery.rohan88.flights LIMIT1;

You can perform similar queries from Presto Cli as well, here is another example of running sql queries on different Bigquery dataset from Preso Cli. 

$./presto-cli.jar --server https://<presto.cluster.url> --catalog bigquery --schema <schema_name> --user <presto_username> --password

Following example shows how you can join Google BigQuery table with the Hive table from S3 and run sql queries. 

At Ahana, we have made it very simple and user friendly to run SQL workloads on Presto in the cloud. You can get started with Ahana Cloud today and start running sql queries in a few mins.

Snowflake may not be the silver bullet you wanted for your long term data strategy… here’s why

Since COVID, every business has pivoted and moved everything online, accelerating digital transformation with data and AI. Self-service, accelerated analytics has become more and more critical for businesses and Snowflake did a great job bringing cloud data warehouses into the market when users were struggling with on-prem big data solutions and trying to catch up their cloud journey. Snowflake is designed foundationally to take advantage of the cloud’s benefits, and while Snowflake has benefited from a first-mover advantage, here are the key areas you should think about as you evaluate a cloud data warehouse like Snowflake. 

Open Source; Vendor Lock-in

Using an SQL engine that is open source is strategically important because it allows the data to be queried without the need to ingest it into a proprietary system. Snowflake is not Open Source Software. Only data that has been aggregated and moved into Snowflake in a proprietary format is available to its users. Moreover, Snowflake is pushing back on open source due to their proprietary solutions. Recently, Snowflake announced Snowflake Data Cloud, where they position Snowflake as a platform for “Cloud Data” where organizations can move and store all the data. 

However, surrendering all your data to the Snowflake data cloud model creates vendor lock-in challenges: 

  1. Excessive cost as you grow your data warehouse
  2. If ingested into another system, data is typically locked into formats of the closed source system
  3. No community innovations or way to leverage other innovative technologies and services to process that same data

Snowflake doesn’t benefit from community innovation that true open source projects benefit from. For example, an open source project like Presto has many contributions from engineers across Twitter, Uber, Facebook, Ahana and more. At Twitter, engineers are working on the Presto-Iceberg connector, aiming to bring high-performance data analytics on open table format to the Presto ecosystem. 

Check this short session on an overview of how we are evolving Presto to be the next generation query engine at Facebook and beyond. 

With a proprietary technology like Snowflake, you miss out on community-led contributions that can shape a technology for the best of everyone. 

Open Format

Snowflake has chosen to use a micro-partition file format that might be good for performance but is closed source. The Snowflake engine cannot work directly with most common open formats like Apache Parquet, Apache Avro, Apache ORC, etc. Data can be imported from these open formats to an internal Snowflake file format, but you miss out on the performance optimizations that these open formats can bring to your engine, including dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes, avoid many small files, avoid few huge files, etc.

On the other hand, Presto users can use Tensorflow on the same open formats, like parquet and ORC, so there’s a lot of flexibility that you get with this open data lake architecture. Using open formats gives users the flexibility to pick the right engine for the right job without the need for an expensive migration.

While migrating from legacy data warehouse platforms to Snowflake may offer less friction for cloud adoption, trying to integrate open source formats into a single proprietary solution may not be as simple as sold.

Check this session on how you can leverage Apache Parquet, Apache Hudi and PrestoDB integration to build Open Data Lake.

Federated queries

A SQL engine is needed for both the data lake where raw data resides as well as the broad range of other data sources so that an organization can mix and match data from any source. If your data resides in relational databases, NoSQL databases, cloud storages, file systems like HDFS, etc., then Snowflake is not suitable for your self-service data lake strategy.. You can not run SQL queries across data stored in relational, non-relational, object, and custom data sources using Snowflake.

Workload Flexibility

Today users want to create new applications at the same rate as their data is growing and a single database is not a solution to support a broad range of analytical use cases. One common workload is training and using machine learning models right over warehouse tables or streaming analytics. Snowflake focuses on a traditional data warehouse as a managed service on the cloud and requires proprietary connectors to address these ML/DS workloads, which brings up data lineage challenges.

If you have a lot of unstructured data like text or images, the volume is beyond petabytes, or schema-on-read is a must-have feature, then Snowflake does not fit into your data lake architecture. 

The new generation of open platforms that unifies the data warehouse and advanced analytics is something that Snowflake is not fundamentally designed for; Snowflake is only suitable for data warehouse use cases.

Data Ownership

Snowflake did decouple storage and compute. However, Snowflake does not decouple data ownership. It still owns the compute layer as well as the storage layer. This means users must ingest data into Snowflake using a proprietary format, creating yet another copy of data and also requiring users to move their data out of their own environment. Users lose ownership of their data.

Cost

Users think of Snowflake as an easy and low cost model. However, it gets very expensive and cost prohibitive to ingest data into a Snowflake. Very large data and enterprise grade long running queries can result in significant costs associated with Snowflake.

As Snowflake is not fully decoupled, data is copied and stored into Snowflake’s managed cloud storage layer within Snowflake’s account. Hence, the users end up paying a higher cost to Snowflake than the cloud provider charges, not to mention the costs associated with cold data. Further, Security features come with a higher price with a proprietary tag.

Conclusion

Snowflake may sound appealing in how simple it is to implement a data warehouse in the cloud. However, an open data lake analytics strategy will augment the data warehouse system in places the warehouse may fall short as discussed above,  providing significant long-term strategic benefits  to users. 

With PrestoDB as a SQL Engine for Open Data Lake Analytics, you can execute SQL queries at high-performance, similar to the EDW. Using Presto for your data lake analytics means you don’t have to worry about vendor lock-in and gets you the benefits of open source goodness like RaptorX, Project Aria, Apache Ranger integration, etc. Check this short tutorial on how to query a data lake with Presto

While powerful, Presto can be complex and resource-intensive when it comes to managing and deploying. That’s where Ahana comes in. Ahana Cloud is the easiest managed service for PrestoDB in the cloud. We simplify open source operational challenges and support top innovations in the Presto community. 

As Open Data Lake Analytics evolves, we have a great and advanced roadmap ahead. You can get started with Ahana Cloud today.

Can I write back or update data in my Hadoop / Apache Hive cluster through Presto?

Using Presto with a Hadoop cluster for SQL analytics is pretty common especially in on premise deployments. 

With Presto, you can read and query data from the Hadoop datanodes but you can also make changes to data in Hadoop HDFS. There are however some restrictions. 

All this is enabled via Presto’s Hive Connector. 

The first step is to create a catalog properties file and point to the Hive Metastore. 

You can also optionally configure some Hive Metastore properties for the Hive Connector. 

Create etc/catalog/hive.properties with the following contents to mount the hive-hadoop2 connector as the hive catalog, replacing example.net:9083 with the correct host and port for your Hive metastore Thrift service:

connector.name=hive-hadoop2

hive.metastore.uri=thrift://example.net:9083

For basic setups, Presto configures the HDFS client automatically and does not require any configuration files. 

Creating a new table in HDFS via Hive

Using Presto you can create new tables via the Hive Metastore. 

Examples

Create table: 

Create a new Hive table named page_views in the web schema that is stored using the PARQUET file format, partitioned by date and country. HIVE is the name of the connector – that is the name of the properties file. 

CREATE TABLE hive.web.page_views (

  view_time timestamp,

  user_id bigint,

  page_url varchar,

  ds date,

  country varchar

)

WITH (

  format = 'ORC',

  partitioned_by = ARRAY['ds', 'country']

)

Deleting data from Hive / Hadoop

With the Hive connector, you can delete data but this has to be at the granularity of entire partitions. 

Example: 

Drop a partition from the page_views table:

DELETE FROM hive.web.page_views

WHERE ds = DATE '2016-08-09'

  AND country = 'US'

Drop the external table request_logs. This only drops the metadata for the table. The referenced data directory is not deleted:

DROP TABLE hive.web.request_logs

Drop a schema:

DROP SCHEMA hive.web

Hive Connector Limitations

  • DELETE is only supported if the WHERE clause matches entire partitions.
  • UPDATE is not supported from Presto

How do I convert Unix Epoch time to a date or something more human readable with SQL?

Many times the Unix Epoch Time gets stored in the database. But this is not very human readable and conversion is required for reports and dashboards. 

Example of Unix Epoch Time: 

1529853245

Presto provides many date time functions to help with conversion. 

In case of a Unix Epoch Time, the from_unixtime function can be used to convert the Epoch time. 

This function returns a timestamp. 

from_unixtime(unixtime) → timestamp
Returns the UNIX timestamp unixtime as a timestamp.


After converting the Unix Epoch time to a timestamp, you can cast it into other formats as needed such as extracting just the date. Examples follow below. 


Examples: 

Query

select from_unixtime(1529853245) as timestamp;

Result

timestamp

2018-06-24 15:14:05.000

Query

select cast(from_unixtime(1529853245) as date) as date;

Result

date

2018-06-24
More examples and information can be found here: https://ahana.io/answers/how-to-convert-date-string-to-date-format-in-presto/

How do I transfer data from a Hadoop / Hive cluster to a Presto cluster?

Hadoop is a system that manages both compute and data together. Hadoop cluster nodes have the HDFS file system and may also have different types of engines like Apache Hive, Impala or others running on the same or different nodes. 

In comparison, Presto, an open source SQL engine built for data lakes, is only a query engine. This means that it does not manage its own data. It can query data sitting in other places like HDFS or in cloud data lakes like AWS S3. 

Because of this there is no data transfer or ingestion required into Presto for data that is already residing in an HDFS cluster. Presto’s Hive Connector was specifically designed to access data in HDFS and query it in Presto. The Hive connector needs to be configured with the right set of config properties 

The Presto Hive connector supports Apache Hadoop 2.x and derivative distributions including Cloudera CDH 5 and Hortonworks Data Platform (HDP).

Create etc/catalog/hive.properties with the following contents to mount the hive-hadoop2 connector as the hive catalog, replacing example.net:9083 with the correct host and port for your Hive metastore Thrift service:

connector.name=hive-hadoop2

hive.metastore.uri=thrift://example.net:9083

For basic setups, Presto configures the HDFS client automatically and does not require any configuration files. In some cases, such as when using federated HDFS or NameNode high availability, it is necessary to specify additional HDFS client options in order to access your HDFS cluster. To do so, add the hive.config.resources property to reference your HDFS config files:

hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml

Hands-on Presto Tutorial: How to run Presto on Kubernetes

ahana logo

What is Presto?

Presto is a distributed query engine designed from the ground up for data lake analytics and interactive query workloads.

Presto supports connectivity to a wide variety of data sources – relational, analytical, NoSQL, object stores including s search and indexing systems such as elastic and druid. 

The connector architecture abstracts away the underlying complexities of the data sources whether it’s SQL, NoSQL or simply an object store – all the end user needs to care about is querying the data using ANSI SQL; the connector takes care of the rest.

How is Presto typically deployed?

Presto deployments can be found in various flavors today. These include:

  1. Presto on Hadoop: This involves Presto running as a part of a Hadoop cluster, either as a part of open source or commercial Hadoop deployments (e.g. Cloudera) or as a part of Managed Hadoop (e.g. EMR, DataProc) 
  2. DIY Presto Deployments: Standalone Presto deployed on VMs or bare-metal instances
  3. Serverless Presto (Athena): AWS’ Serverless Presto Service
  4. Presto on Kubernetes: Presto deployed, managed and orchestrated via Kubernetes (K8s)

Each deployment has its pros and cons. This blog will focus on getting Presto working on Kubernetes.

All the scripts, configuration files, etc. can be found in these public github repositories:

https://github.com/asifkazi/presto-on-docker

https://github.com/asifkazi/presto-on-kubernetes

You will need to clone the repositories locally to use the configuration files.

git clone <repository url>

What is Kubernetes (K8s)?

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes groups containers that make up an application into logical units for easy management and discovery. 

In most cases deployments are managed declaratively, so you don’t have to worry about how and where the deployment is running. You simply declaratively specify your resource and availability needs and Kubernetes takes care of the rest.

Why Presto on Kubernetes?

Deploying Presto on K8s brings together the architectural and operational advantages of both technologies. Kubernetes’ ability to ease operational management of the application significantly simplifies the Presto deployment – resiliency, configuration management, ease of scaling in-and-out come out of the box with K8s.

A Presto deployment built on K8s leverages the underlying power of the Kubernetes platform and provides an easy to deploy, easy to manage, easy to scale, and easy to use Presto cluster.

Getting Started – What do I need?

Local Docker Setup

To get your bearings and see what is happening with the Docker containers running on Kubernetes, we will first start with a single node deployment running locally on your machine. This will get you familiarized with the basic configuration parameters of the Docker container and make it way easier to troubleshoot.

Feel free to skip the local docker verification step if you are comfortable with docker, containers and Kubernetes.

Kubernetes / EKS Cluster

To run through the Kubernetes part of this tutorial, you need a working Kubernetes cluster. In this tutorial we will use AWS EKS (Elastic Kubernetes Service). Similar steps can be followed on any other Kubernetes deployment (e.g. Docker’s Kubernetes setup) with slight changes e.g. reducing the resource requirements on the containers.

If you do not have an EKS cluster and would like to quickly get an EKS cluster setup, I would recommend following the instructions outlined here. Use the “Managed nodes – Linux” instructions.

You also need to have a local cloned copy of the github repository https://github.com/asifkazi/presto-on-kubernetes

Nodegroups with adequate capacity

Before you go about kicking off your Presto cluster, you want to make sure you have node groups created on EKS with sufficient capacity.

After you have your EKS cluster created (in my case it’s ‘presto-cluster’), you should go in and add a node group which has sufficient capacity for the Presto Docker containers to run on. I plan on using R5.2xlarge nodes. I setup a node group of 4 nodes (You can tweak your Presto Docker container settings accordingly and use smaller nodes if required).

Figure 1: Creating a new nodegroup

Figure 2: Setting the instance type and node count

Once your node group shows active you are ready to move onto the next step

Figure 3: Make sure your node group is successfully created and is active

Tinkering with the Docker containers locally

Let’s first make sure the Docker container we are going to use with Kubernetes is working as desired. If you would like to review the Docker file, the scripts and environment variable supported the repository can be found here.

The details of the specific configuration parameters being used to customize the container behavior can be found in the entrypoint.sh script. You can override any of the default values by providing the values via –env option for docker or by using name-value pairs in the Kubernetes yaml file as we will see later.

You need the following:

  1. A user and their Access Key and Secret Access Key for Glue and S3 (You can use the same or different user): 

 arn:aws:iam::<your account id>:user/<your user>

  1. A role which the user above can assume to access Glue and S3:

arn:aws:iam::<your account id>:role/<your role>

Figure 4: Assume role privileges

Figure 5: Trust relationships

Graphical user interface, text, application

Description automatically generated

  1. Access to the latest docker image for this tutorial asifkazi/presto-on-docker:latest

Warning: The permissions provided above are pretty lax, giving the user a lot of privileges not just on assume role but also what operations the user can perform on S3 and Glue. DO NOT use these permissions as-is for production use. It’s highly recommended to tie down the privileges using the principle of least privilege (only provide the minimal access required)

Run the following commands:

  1. Create a network for the nodes

docker create network presto

  1. Start a mysql docker instance

docker run --name mysql -e MYSQL_ROOT_PASSWORD='P@ssw0rd$$' -e MYSQL_DATABASE=demodb -e MYSQL_USER=dbuser -e MYSQL_USER=dbuser -e MYSQL_PASSWORD=dbuser  -p 3306:3306 -p 33060:33060 -d --network=presto mysql:5.7

  1. Start the presto single node cluster on docker

docker run -d --name presto \

 --env PRESTO_CATALOG_HIVE_S3_IAM_ROLE="arn:aws:iam::<Your Account>:role/<Your Role>"  \

--env PRESTO_CATALOG_HIVE_S3_AWS_ACCESS_KEY="<Your Access Key>" \

--env PRESTO_CATALOG_HIVE_S3_AWS_SECRET_KEY="<Your Secret Access Key>" \

--env PRESTO_CATALOG_HIVE_GLUE_AWS_ACCESS_KEY="<Your Glue Access Key>" \

--env PRESTO_CATALOG_HIVE_GLUE_AWS_SECRET_KEY="<Your Glue Secret Access Key>" \

--env PRESTO_CATALOG_HIVE_METASTORE_GLUE_IAM_ROLE="arn:aws:iam:: <Your Account>::role//<Your Role>" \

-p 8080:8080 \

--network=presto \

asifkazi/presto-on-docker:latest

  1. Make sure the containers came up correctly:

docker ps 

  1. Interactively log into the docker container:

docker exec -it presto bash

  1. From within the docker container we will verify that everything is working correctly:
  1. Run the following command:

presto

  1. From within the presto cli run the following:

show schemas from mysql

The command should show the mysql databases

  1. From within the presto cli run the following:

show schemas from hive

The command should show the databases from glue. If you are using glue for the first time you might only see the information_schema and default database.

We have validated that the docker container itself is working fine, as a single node cluster (worker and coordinator on the same node). We will not move to getting this environment now working in Kubernetes. But first, let’s clean up.

Run the following command to stop and cleanup your docker instances locally.

docker stop mysql presto;docker rm mysql presto;

Getting Presto running on K8s

To get presto running on K8s, we will configure the deployment declaratively using YAML files. In addition to Kubernetes specific properties, we will provide all the docker env properties via name value pairs.

  1. Create a namespace for the presto cluster

kubectl create namespace presto

  1. Override the env settings in the presto.yaml file for both the coordinator and worker sections
  1. Apply the yaml file to the Kubernetes cluster

kubectl apply -f presto.yaml –namespace presto

  1. Let’s also start a mysql instance. We will first start by creating a persistent volume and claim. 

kubectl apply -f ./mysql-pv.yaml --namespace presto

  1. Create the actual instance

kubectl apply -f ./mysql-deployment.yaml --namespace presto

  1. Check the status of the cluster make sure there are no errored or failing pods

kubectl get pods -n presto

  1. Log into the container and repeat the verification steps for mysql and Hive that we executed for docker. You are going to need the pod name for the coordinator from the command above.

kubectl exec -it  <pod name> -n presto  -- bash

kubectl exec -it presto-coordinator-5294d -n presto  -- bash

Note: the space between the —  and bash is required

  1. Querying seems to be working but is the Kubernetes deployment a multi-node cluster? Let’s check:

select node,vmname,vmversion from jmx.current."java.lang:type=runtime";

  1. Let’s see what happens if we destroy one of the pods (simulate failure)

kubectl delete pod presto-worker-k9xw8 -n presto

  1. What does the current deployment look like?

What? The pod was replaced by a new one presto-worker-tnbsb!

  1. Now we’ll modify the number of replicas for the workers in the presto.yaml
  1. Set replicas to 4

Apply the changes to the cluster

kubectl apply -f presto.yaml –namespace presto

Check the number of running pods for the workers

kubectl get pods -n presto

Wow, we have a fully functional presto cluster running! Imagine setting this up manually and tweaking all the configurations yourself, in addition to managing the availability and resiliency. 

Summary

In this tutorial we setup a single node Presto cluster on Docker and then deployed the same image to Kubernetes. By taking advantage of the Kubernetes configuration files and constructs, we were able to scale out the Presto cluster to our needs as well as demonstrate resiliency by forcefully killing off a pod.

Kubernetes and Presto, better together. You can run large scale deployments of one or more Presto clusters with ease.

Ready for your next Presto lesson from Ahana? Check out our guide to running Presto with AWS Glue as catalog on your laptop.

Presto 102 Tutorial: Install PrestoDB on a Laptop or PC

Summary

Prestodb is an open source distributed parallel query SQL engine. In tutorial 101 we walk through manual installation and configuration on a bare metal server or on a VM. It is a very common practice to try prestodb on a laptop for quick validation and this guide, Tutorial 102, will walk through simple steps to install a three node prestodb cluster on a laptop. 

Environment

This guide was developed using a laptop running windows OS and docker on it. 

Steps for Implementing Presto

Step 1: 

Create a docker network namespace, so that containers could communicate with each other using the network namespace.

C:\Users\rupendran>docker network create presto_networkd0d03171c01b5b0508a37d968ba25638e6b44ed4db36c1eff25ce31dc435415b

Step 2

Ahana has developed a sandbox for prestodb which can be downloaded from docker hub, use the below command to download prestodb sandbox, which comes with all packages needed to run prestodb.

C:\Users\prestodb>docker pull ahanaio/prestodb-sandbox
Using default tag: latestlatest: Pulling
from ahanaio/prestodb-sandboxda5a05f6fddb: Pull complete
e8f8aa933633: Pull complete
b7cf38297b9f: Pullcomplete
a4205d42b3be: Pull complete
81b659bbad2f: Pull complete
3ef606708339: Pull complete
979857535547: Pull complete
Digest: sha256:d7f4f0a34217d52aefad622e97dbcc16ee60ecca7b78f840d87c141ba7137254Status:
Downloaded newer image for ahanaio/prestodb-sandbox:latestdocker.io/ahanaio/prestodb-sandbox:latest

Step 3:

Start the instance of the the prestodb sandbox and name it as coordinator.

C:\Users\prestodb>docker run -d -p 8080:8080 -it --net presto_network --name coordinator ahanaio/prestodb-sandboxdb74c6f7c4dda975f65226557ba485b1e75396d527a7b6da9db15f0897e6d47f

Step 4:

Check cluster UI on the status, by default the Ahana prestodb sandbox comes with one worker and a coordinator.

If only the coordinator needs to be running without the worker node then edit  the config.properties file and set the node-scheduler.include-cooridinator to false.

sh-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery-server.enabled=true
discovery.uri=http://localhost:8080
sh-4.2#

Now the prestodb UI will show Active worker count to 0.

Step 5: 

Start another instance of prestodb sandbox which will run as worker node

C:\Users\rupendran>docker run -d -p 8081:8081 -it --net presto_network --name workerN1 ahanaio/prestodb-sandbox
80dbb7e1d170434e06c10f9316983291c10006d53d9c6fc8dd20db60ddb4a58c

Step 6: 

Since sandbox comes with coordinator it needs to be disabled for the second instance and run it as worker node, to do that click on the terminal window on the docker container/apps UI  and edit etc/config.properties file to set coordinator to false and set http port to be different from coordinator.

sh-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://coordinator:8080
sh-4.2#

Step 7:

Restart the worker1 container check prestodb UI, now the active worker count will be set to either 1 if co-ordinator runs without a worker node or 2 if the coordinator also runs a worker node.

Step 8:

Repeat steps 5 to 7 to add a third worker node. 

  • start new instance of Ahanaio/prestodb sandbox
  • Disable coordinator and set port to be different than coordinator and set URI to the container name of the coordinator

C:\Users\rupendran>docker run -d -p 8082:8082 -it --net presto_network --name workerN2 ahanaio/prestodb-sandbox
16eb71da54d4a9c30947970ff6da58c65bdfea9cb6ad0c76424d527720378bdd

Step 9: 

Check cluster status, should reflect the third worker node as part of the prestodb cluster.

Step 10:

Verify the prestodb environment by running the prestodb CLI with simple tpch query

sh-4.2# presto-cli
presto> SHOW SCHEMAS FROM tpch;

Schema

information_schema
sf1
sf100
sf1000
sf10000
sf100000
sf300
sf3000
sf30000
tiny
(10 rows)

Query 20210709_195712_00006_sip3d, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0.01 [10 rows, 119B] [12 row/s, 153B/s]

presto>

Summary:

Prestodb cluster installation is simplified with Ahana prestodb sandbox. It’s now ready to be used for any functional validation.

Enabling spill to disk for optimal price per performance

Presto was born out of the need for low-latency interactive queries on large scale data, and hence, continually optimized for that use case. In such scenarios, the best practice is to properly size Presto Worker nodes such that all the aggregate cluster memory can fit all the data required for target data sources, queries, and concurrency level. In addition, to ensure fairness of memory across queries and prevent deadlocks, by default, Presto will kill queries that exceed configured memory limits.

The Case for Spill to Disk

As Presto usage and adoption continues to grow, it is being used for more and more different use cases. For some of these use cases, the need for full memory bandwidth and low-latency is not necessary. For example, consider long running queries on large historical data, such as logs, where low latency results is not paramount. In these cases, it may be acceptable, and even more optimal overall, to trade some performance for cost savings. One way to achieve this is, of course, to use lower memory Presto Workers. However, perhaps these longer batch workloads where higher latency is tolerable is not the norm, but the minority case. Enter Presto spill to disk functionality where Presto can be configured to spill intermediate data from memory to disk when needed. While queries that spill to disk have longer execution times compared to an entire in-memory equivalent, the query will not fail due to exceeding configured memory properties.

Cost Savings of Spill to Disk

Let’s walk through a practical example of a spill to disk scenario. A 15 Worker Presto cluster of r5.2xlarge instances (64 GB memory, 8 vCPU, $0.5 per hour) in AWS costs about $180 per day with an aggregate cluster memory of close to 1 TB (960 GB actual). Instead of a 15 Worker Presto cluster, if we had a cluster with 30% less Presto Workers at 10 nodes, we would be decreasing the aggregate cluster memory by 320 GB. But, let’s say with augment the cluster with 1 TB (> 3x of 350 GB) of disk storage across all the remaining 10 nodes (100 GB per node) to leverage Presto disk spilling. At $0.10 per GB-month for gp2 EBS volumes, the storage costs is only $100 per month. The storage costs is less than a 1% of the memory costs, even with a 3x factor.

Spill to Disk Configuration

There are several configuration properties that need to be set to use spill to disk and they are documented in the Presto documentation. Here is an example configuration with 50 GB of storage allocated to each Worker for spilling.

experimental.spiller-spill-path=/path/to/spill/directory
experimental.spiller-max-used-space-threshold=0.7
experimental.max-spill-per-node=50GB
experimental.query-max-spill-per-node=50GB
experimental.max-revocable-memory-per-node=50GB
  • experimential.spiller-spill-path: Directoy where spilled content will be written.
  • experimental.spiller-max-used-space-threshold: If disk space usage ratio of a given spill path is above this threshold, the spill path will not be eligible for spilling.
  • experimental.max-spill-per-node: Max spill space to be used by all queries on a single node.
  • experimental.query-max-spill-per-node: Max spill space to be used by a single query on a single node.
  • experimental.max-revocable-memory-per-node: How much revocable memory any one query is allowed to use.

Conclusion

Several large scale Presto deployments take advantage of spill to disk, including Facebook. Today, the Ahana Cloud managed Presto service enables spill to disk by default and sets the per node spill to disk limit at 50 GB. We will be releasing the ability for customers to configure and tune their per node spill-to-disk size soon. Give it a try. You can sign up and start using our service today for free.

Presto substring operations: How do I get the X characters from a string of a known length?

Presto provides an overloaded substring function to extract characters from a string. We will use the string “Presto String Operations” to demonstrate the use of this function.

Extract last 7 characters:

presto> SELECT substring('Presto String Operations',-7) as result;

 result

---------

 rations

(1 row)

Query 20210706_225327_00014_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Extract last 10 characters:

presto> SELECT substring('Presto String Operations',-10) as result;

   result

------------

 Operations

(1 row)

Query 20210706_225431_00015_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Extract the middle portion of the string:

presto> SELECT substring('Presto String Operations',8,6) as result;

 result

--------

 String

(1 row)

Query 20210706_225649_00020_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:01 [0 rows, 0B] [0 rows/s, 0B/s]

Extract the beginning portion of the string:

presto> SELECT substring('Presto String Operations',1,6) as result;

 result

--------

 Presto

(1 row)

Query 20210706_225949_00021_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Presto 101 Tutorial: Installing & Configuring Presto

Installing & Configuring Presto locally

Presto Installation

Presto can be installed manually or using docker images on:

  • Single Node: Both co-ordinator and workers run on the same machine.
  •  or even multiple machines depending on the workload requirements.

Manual Installing Presto

Download the Presto server tarball, presto-server-0.253.1.tar.gz and unpack it. The tarball will contain a single top-level directory, presto-server-0.253.1 which we will call the installation directory.

Run the commands below to install the official tarballs for presto-server and presto-cli from prestodb.io

[root@prestodb_c01 ~]# curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.235.1/presto-server-0.235.1.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed100  721M  100  721M    0     0  72.9M      0  0:00:09  0:00:09 --:--:--  111M[root@prestodb_c01 ~]# curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.235.1/presto-cli-0.235.1-executable.jar
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed100 12.7M  100 12.7M    0     0  21.9M      0 --:--:-- --:--:-- --:--:-- 21.9M

Data Directory

Presto needs a data directory for storing logs, etc. We recommend creating a data directory outside of the installation directory, which allows it to be easily preserved when upgrading Presto.

[root@prestodb_c01 ~]# mkdir -p /var/presto/data

Configuration Settings

Create an etc directory inside the installation directory. This will hold the following configuration:

  • Node Properties: environmental configuration specific to each node
  • JVM Config: command-line options for the Java Virtual Machine
  • Config Properties: configuration for the Presto server
  • Catalog Properties: configuration for Connectors (data sources)
[root@prestodb_c01 ~]# mkdir etc

Node Properties

The node properties file, etc/node.properties contains configuration specific to each node. A node is a single installed instance of Presto on a machine. This file is typically created by the deployment system when Presto is first installed. The following is a minimal etc/node.properties:

[root@prestodb_c01 ~]# cat etc/node.propertiesnode.environment=productionnode.id=ffffffff-ffff-ffff-ffff-ffffffffffffnode.data-dir=/var/presto/data

The above properties are described below:

  • node.environment: The name of the environment. All Presto nodes in a cluster must have the same environment name.
  • node.id: The unique identifier for this installation of Presto. This must be unique for every node. This identifier should remain consistent across reboots or upgrades of Presto. If running multiple installations of Presto on a single machine (i.e. multiple nodes on the same machine), each installation must have a unique identifier.
  • node.data-dir: The location (filesystem path) of the data directory. Presto will store logs and other data here.

JVM configuration

The JVM config file, etc/jvm.config, contains a list of command-line options used for launching the Java Virtual Machine. The format of the file is a list of options, one per line. These options are not interpreted by the shell, so options containing spaces or other special characters should not be quoted.

The following provides a good starting point for creating etc/jvm.config:

[root@prestodb_c01 ~]# cat etc/jvm.config
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

Because an OutOfMemoryError will typically leave the JVM in an inconsistent state, we write a heap dump (for debugging) and forcibly terminate the process when this occurs.

Config Properties

The config properties file, etc/config.properties, contains the configuration for the Presto server. Every Presto server can function as both a coordinator and a worker, but dedicating a single machine to only perform coordination work provides the best performance on larger clusters.

In order to set up a single machine for testing that will function as both a coordinator and worker, then set the below parameters to true in etc/config.properties

[root@singlenode01 ~]# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://example.net:8080
  • coordinator: Allow this Presto instance to function as a coordinator (accept queries from clients and manage query execution).
  • node-scheduler.include-coordinator: Allow scheduling work on the coordinator. 
  • http-server.http.port: Specifies the port for the HTTP server. Presto uses HTTP for all communication, internal and external.
  • query.max-memory: The maximum amount of distributed memory that a query may use.
  • query.max-memory-per-node: The maximum amount of user memory that a query may use on any one machine.
  • query.max-total-memory-per-node: The maximum amount of user and system memory that a query may use on any one machine, where system memory is the memory used during execution by readers, writers, and network buffers, etc.
  • discovery-server.enabled: Presto uses the Discovery service to find all the nodes in the cluster. Every Presto instance will register itself with the Discovery service on startup. In order to simplify deployment and avoid running an additional service, the Presto coordinator can run an embedded version of the Discovery service. It shares the HTTP server with Presto and thus uses the same port.
  • discovery.uri: The URI to the Discovery server. Because we have enabled the embedded version of Discovery in the Presto coordinator, this should be the URI of the Presto coordinator. Replace example.net:8080 to match the host and port of the Presto coordinator. This URI must not end in a slash.

You may also wish to set the following properties:

  • jmx.rmiregistry.port: Specifies the port for the JMX RMI registry. JMX clients should connect to this port.
  • jmx.rmiserver.port: Specifies the port for the JMX RMI server. Presto exports many metrics that are useful for monitoring via JMX.

Log Levels

The optional log levels file, etc/log.properties allows setting the minimum log level for named logger hierarchies. Every logger has a name, which is typically the fully qualified name of the class that uses the logger. 

[root@coordinator01 ~]# cat  etc/log.properties
com.facebook.presto=INFO

There are four levels: DEBUG, INFO, WARN and ERROR.

Catalog Properties

Presto accesses data via connectors, which are mounted in catalogs. The connector provides all of the schemas and tables inside of the catalog. 

Catalogs are registered by creating a catalog properties file in the etc/catalog directory. For example, create etc/catalog/jmx.properties with the following contents to mount the jmx connector as the jmx catalog

[root@coordinator01 ~]# mkdir etc/catalog
[root@coordinator01 ~]# echo "connector.name=jmx" >>
etc/catalog/jmx.properties

Running Presto

The installation directory contains the launcher script in bin/launcher. Presto can be started as a daemon by running the following:

[root@hsrhvm01 presto-server-0.235.1]# bin/launcher start
Started as 23378

After launching, you can find the log files in var/log:

  • launcher.log: This log is created by the launcher and is connected to the stdout and stderr streams of the server. It will contain a few log messages that occur while the server logging is being initialized and any errors or diagnostics produced by the JVM.
  • server.log: This is the main log file used by Presto. It will typically contain the relevant information if the server fails during initialization. It is automatically rotated and compressed.
  • http-request.log: This is the HTTP request log which contains every HTTP request received by the server. It is automatically rotated and compressed.

What is Spark SQL?

Spark is a general purpose computation engine for large-scale data processing. At Spark’s inception, the primary abstraction was a resilient distributed dataset (RDD), an immutable distributed collection of data. Since then, higher level abstractions—called DataFrames and Datasets—that more closely resemble classic database tables have been introduced to work with structured data. Spark SQL is the Spark module for working with these abstractions and structured data.

In addition to DataFrames and Datasets, Spark SQL also exposes SQL to interact with data stores and DataFrames/Datasets. For example, let’s say we wanted to return all records from a table called people with the basic SQL query: SELECT * FROM people. To do so with Spark SQL, we could programmatically express this in Python as follows:

people_dataframe = spark.sql(“SELECT * FROM people”)

spark is a SparkSession class, the main Spark entry point for structured data abstractions, and the statement would return a Spark DataFrame. With the SQL API, you can express SQL queries and get back DataFrames, and from DataFrames, you can create tables by which you can execute SQL queries on top of. Because the SQL language is widely known, it allows a broader range of data practitioner personas, such as SQL analysts, to perform data processing on top of Spark.

Since Spark 3.0, Spark SQL introduced experimental options to be strictly ANSI compliant instead of being Hive compliant. Prior, Spark SQL supported both ANSI SQL and HiveQL.  Please consult the official Apache Spark SQL Reference if you are interested in the specifics of supported syntax, semantics, and keywords.

Regardless of whether you express data processing directly with DataFrame/Dataset methods or SQL, Spark SQL runs the same execution engine under the hood.  Further, through Spark SQL, the structured nature of the data and processing provide additional context to Spark about the data itself—such as the column types—and the workload.  This additional context allows for additional optimization, often resulting in better performance.

While Spark SQL is a general-purpose engine, you might want to consider Presto if your target use cases are predominantly interactive, low-latency queries on structured data. We compare Spark SQL and Presto in this short article.

Hive vs Presto vs Spark

What is Apache Hive?

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive provides an SQL-like interface called HiveQL to query large dataset stored in Hadoop’s HDFS and compatible file systems such as Amazon S3.

What is Presto?

Presto is a high-performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, MySQL, and other relational and non-relational databases. One can even query data from multiple data sources within a single query.

What is Apache Spark?

Apache Spark is a unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop Input Format. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Commonalities

  • All three projects are community-driven open-source software released under the Apache License.
  • They are distributed “Big Data” software frameworks
  • BI tools connect to them using JDBC/ODBC
  • They provide query capabilities on top of Hadoop and AWS S3
  • They have been tested and deployed at petabyte-scale companies
  • They can be run on-prem or in the cloud.

Differences

HivePrestoSpark
FunctionMPP SQL engineMPP SQL engineGeneral purpose execution framework
Processing TypeBatch processing using Apache Tez or MapReduce compute frameworksExecutes queries in memory, pipelined across the network between stages, thus avoiding unnecessary I/OOptimized directed acyclic graph (DAG) execution engine and actively caches data in-memory
SQL SupportHiveQLANSI SQLSpark SQL
UsageOptimized for query throughputOptimized for latencyGeneral purpose, often used for data transformation and Machine Learning workloads
Use casesLarge data aggregationsInteractive queries and quick data exploration.General purpose, often used for data transformation and Machine Learning workloads.

Conclusion

It totally depends on your requirement to choose the appropriate SQL engine but if the Presto engine is what you are looking for, we suggest you give a try to Ahana Cloud for Presto.
Ahana Cloud for Presto is the first fully integrated, cloud-native managed service for Presto that simplifies the ability of cloud and data platform teams of all sizes to provide self-service, SQL analytics for their data analysts and scientists. Basically we’ve made it really easy to harness the power of Presto without having to worry about the thousands of tuning and config parameters, adding data sources, etc.

Ahana Cloud is available in AWS. We have a free trial you can sign up for today.

How do I query a data lake with Presto?

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. Structured and semi-structured data can be queried by Presto, an open source SQL engine. This allows users to store data as-is, without having to first structure the data, and run different types of analytics. 

To query this data in data lakes, the following technologies are needed. 

  1. A SQL query engine – Presto was built for querying data lakes like HDFS and now increasingly AWS S3 and Google Cloud Platform – Google Cloud Storage and others. 
  2. A big data catalog – there are two popular big data catalog systems also called metastores – the Hive Metastore and AWS Glue service. 
  3. Buckets in the data lake like AWS S3 

What types of data can be queried by Presto? 

The following file formats can be queried by Presto 

  1. ORC.
  2. Parquet.
  3. Avro.
  4. RCFile.
  5. SequenceFile.
  6. JSON.
  7. Text.
  8. CSV

How does it work? 

First, data in the data lake needs to be mapped into tables and columns. This is what the Hive Metastore and AWS Glue catalogs help with. Example, if there is a CSV file, once Presto, Hive Metastore and Glue are integrated, users can use Presto commands to create a schema and then the create table statement and map the file to a table and columns. 

Example:

SHOW CATALOGS; 
USE ahana_hive.default; 

CREATE TABLE user ( 
	registration_dttm 	timestamp, 
	id 					int,
	first_name 			varchar,
	last_name 			varchar,
	email 				varchar,
	gender 				varchar,
	ip_address 			varchar,
	cc 					varchar,
	country 			varchar,
	birthdate 			varchar,
	salary 				double,
	title 				varchar,
	comments 			varchar
) WITH ( 
format = CSV, 
skip_header_line_count = 1,
external_LOCATION = 's3a://ahana/userdata/' );

Once the table is created and mapped to the external location, it can immediately be queried. 

Example: 

Select * from user; 

You can run these commands in Presto using the Presto-cli. More information in the docs.

If you are looking for better performance, it is recommended to convert formats like JSON, CSV into more optimized formats like Apache Parquet and Apache ORC. This will improve query performance greatly. This can also be done with Presto using the CREATE TABLE AS command. More information on this here

Ahana Cloud makes it very easy to query a data lake with Presto. It is a managed service for Presto and also comes with a built-in Hive Metastore so that you don’t need to deploy and manage one. In addition, it can also integrate with AWS Glue. Getting started with Ahana Cloud is easy. Here’s how: https://ahana.io/docs/getting-started

Additional resources: 

Presto Docker Container 
Presto-cli Docker Container 
Presto Sandbox Docker Container 

Why am I getting a Presto EMR S3 timeout error?

If you’re using AWS EMR Presto, you can use the S3 select pushdown feature to push down compute operations (i.e. SELECT) and predicate operations (i.e. WHERE) to S3. Pushdown makes query performance much faster because it means queries will only retrieve required data from S3. It also helps in reducing the amount of data transferred between EMR Presto and S3.

If you’re using pushdown for EMR Presto and seeing a timeout error, there might be a few reasons for that. Because Presto uses EMRFS as its file system, there’s a maximum allowable number of client connections to S3 through EMRFS for Presto (500). When using S3 Select Pushdown, you bypass EMRFS when you access S3 for predicate operations so the value of hive.s3select-pushdown.max-connections is what will determine the max number of client connections allowed by worker nodes. Requests that aren’t pushed down use the value of fs.s3.maxConnections.

At this point you might get an error that says “timeout waiting for connection from pool”. That’s because you need to increase the value of both of those values above. Once you do that, that should help solve this problem.

Errors like these are common with Presto EMR. EMR is complex and resource-intensive, and there’s a lot you have to understand when it comes to the specific config and turning parameters for Hadoop. Many companies have switched from EMR Presto to Ahana Cloud, a managed service for Presto on AWS that is much easier to use. Ahana Cloud is a non-Hadoop deployment in a fully managed environment. Users see up to 23x performance with Ahana’s built-in caching.

Check out some of the differences between Presto EMR and Ahana Cloud. If you’re using EMR Presto today, Ahana Cloud might help with some of those pain points. Additionally, Ahana is pay-as-you-go pricing and it’s easy to get started if you’re already an EMR user. 

Ahana Demonstrates Major Momentum in Customer and Community Adoption for Presto 1H 2021

The Presto company also shows significant product momentum with numerous accolades and industry recognition 

San Mateo, Calif. – June 24, 2021 — Ahana, the Presto company, today announced major momentum in customer and community adoption for the first half of the year. Ahana Cloud for Presto has seen strong adoption across many verticals in the mid-size and enterprise markets for its easy to use and high performance cloud managed service to query AWS S3 data lakes. 

“With the rapid growth of data lakes today, companies are turning to SQL query engines like Presto to get fast insights directly on their data lakes and with other data sources,” said Steven Mih, CEO of Ahana. “Presto is increasingly becoming the de facto choice for SQL queries on the data lake because of its performance and open, flexible architecture. But for most companies, leveraging Presto can be complex and resource-intensive, and that’s where Ahana can help in making it incredibly easy to get the power of Presto for your AWS S3-based open data lake analytics. The momentum we’ve seen in the Presto community coupled with Ahana customer adoption and industry accolades is a testament to how critical Presto is to unlock data insights on the now ubiquitous data lake.”

Continuing Customer Success and New Customer Wins

Ahana has continued to grow its customer base across all verticals in the mid-size and enterprise markets, including companies in the telco, FinServ, AdTech, and security industries, and today there are dozens of companies using Ahana Cloud for Presto on AWS. Recent notable additions and success stories include Securonix, Dialog, Carbon, Rev, Metropolis, Requis, and Cartona.

Earlier this year, ad tech company Carbon shared at PrestoCon why they chose Ahana Cloud for Presto to power their customer-facing dashboards and eCommerce company Cartona presented their Ahana Cloud for Presto use case. Securonix, a leading security operations and analytics company, is one of the latest companies to deploy Ahana Cloud for Presto.

Sachin Nayyer, CEO at Securonix, said at the AWS Startup Showcase featuring Ahana, “We are very excited about our partnership with Presto and Ahana because Ahana provides us the ability to cloudify Presto, in addition to being our conduit to the Presto community. We believe this is the analytics solution of the future, and with Ahana for Presto we’re able to offer our customers data that’s queryable at an extremely fast speed at very reasonable price points. That has significant benefits for our customers.”

Open Source Presto Community Momentum

The Presto open source community has also continued to grow exponentially over the course of the year. March’s PrestoCon Day was the largest Presto event to date and featured sessions from Facebook on the Presto roadmap and Twitter on the RaptorX project, plus panel discussions on the Presto ecosystem and Presto, Today and Beyond.

At Percona Live Online, the biggest open source database conference in the world, the Presto community track had hundreds of attendees and featured sessions like the Kubernetes operator for Presto, how Facebook’s usage of Presto drives innovation, and many more from presenters at Facebook, Twitter, AWS and more.

Additionally, the Docker Sandbox Container for Presto hosted by Ahana has seen hundreds of thousands of pulls over the course of the year, demonstrating significant growth in Presto usage.

Global Presto meetups have grown in size to over 1,000 members across the globe in cities like New York City, London, Bangalore, Sydney, and more, and Presto Foundation membership has grown to ten companies with new members Intel and Hewlett Packard Enterprise.

“As a member of the Presto Foundation, Intel is committed to working with the Presto open source project along with Ahana, Facebook, Uber and others to drive even more innovation and community engagement,” said Arijit Bandyopadhyay, CTO – Enterprise Analytics & AI, Head of Strategy – Enterprise & Cloud, Data Platforms Group at Intel Corporation. “We look forward to continuing to build on the fantastic momentum thus far and helping even more developers and enterprises get up and running with Presto.”

“The engagement we’re seeing within the community at meetups and events like PrestoCon Day and Percona Live coupled with the usage of the Docker container are indicative of how much the Presto project continues to grow,” said Dipti Borkar, Chairperson of the Presto Foundation Outreach Committee and Cofounder and Chief Product Officer, Ahana. “As Presto continues to be the de facto query engine for the data lake, we look forward to continuing to build a robust and vibrant community around the project and expand the use cases it supports.”

Product Accolades and Industry Recognition 

Ahana has also received numerous new editorial and industry awards in 2021, including:

Tweet this: @AhanaIO announces major customer and community adoption for 1H 2021 #presto #cloud #AWS #datalake https://bit.ly/2TXBUeW

About Ahana

Ahana, the Presto company, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Why I’m betting on PrestoDB, and why you should too!

Presto Open Data Lake

By Dipti Borkar, Ahana Cofounder, Chief Product Officer & Chief Evangelist

I’ve been in open source software companies and communities for over 10 years now, and in the database industry my whole career. So I’ve seen my fair share of the good, the bad, and the ugly when it comes to open source projects and the communities that surround them. And I’d like to think that all this experience has led me to where I am today – cofounder of Ahana, a company that’s betting big on an open source project: PrestoDB. Let’s first talk about the problem we’re trying to solve.

The Big Shift 

Organizations have been using costly, proprietary data warehouses as the workhorse for analytics over many years. And in the last few years, we’ve seen a shift to cloud data warehouses, like Snowflake, AWS RedShift and BigQuery

Couple that with the fact that organizations have much more data (from terabytes to 10’s and 100’s of terabytes to even petabytes) and more different types of data (telemetry, behavioral, IoT, and event data in addition to enterprise data), there’s even greater urgency for users to not get locked in to proprietary formats and proprietary systems. 

These shifts along with AWS commoditizing the storage layer that is ubiquitous and affordable means that a lot more data is now in cloud object stores like AWS S3, Google Cloud Storage and Azure Blob Store. 

So how do you query the data stored in these data lakes?  How do you ask questions of the data and pull the answers into reports and dashboards? 

The “SQL on S3” Problem 

  • Data lakes are only storage – there is no intelligence in the data lake 
  • If the data is structured, think in tables and columns, SQL is hands down the best way to query it – hey it’s survived 50+ years 
  • If data is semi-structured, think nested like JSON etc. SQL still can be used to query with extensions to the language 
  • But SQL on data lakes was complicated. Hadoop tried it, we know that didn’t work. While it sort of solved this problem, trying to get 70+ different components and projects to integrate and work turned out to be a nightmare for data platform teams. 
  • There was really no simple yet performant way of querying S3

That is where Presto comes in. 

Presto is the best engine built to directly query open formats and data lakes. Presto replaced Hive at Facebook. Presto is the heart of the modern Open Data Lake Analytics Stack – an analytics stack that includes open source, open formats, open interfaces, and open cloud. (You can read more details about this in my Dataversity article.) 

The fact of the matter is that this problem can be solved with many different open source projects/engines. At Ahana, we chose Presto – the engine that runs at Facebook, Uber and Twitter. 

Why PrestoDB?

1. Crazy good tech 

Presto is in memory, scalable and built like a database. And with the new innovations getting added, Presto is only becoming bigger and better. 

Like I mentioned earlier, I’ve been in open source and the database space for a long time and built multiple database engines – both structured and semi-structured (aka NoSQL). I believe that PrestoDB is the query engine most aligned with the direction the analytics market is headed. At Ahana, we’re betting on PrestoDB and have built a managed service around it that solves the problem for Open Data Lake Analytics – the SQL on S3 problem. 

No other open source project has come close. Some that come to mind –

  • Apache Drill unfortunately lost its community. It had a great start being based on the Dremel paper (published by Google) but over time didn’t get the support it should have from the vendor behind it and the community fizzled. 
  • SparkSQL (built on Apache Spark) isn’t built like a database engine but instead is bolted on top of a general purpose computation engine
  • Trino, a hard fork of Presto, is largely focused on a different problem of broad access across many different data sources versus the data lake. I fundamentally believe that all data sources are NOT equal and that data lakes will be the most important data source over time, overtaking the data warehouses. Data sources are not equal for a variety of reasons: 
    • Amount of data stored 
    • Type of information stored 
    • Type of analysis that can be supported on it 
    • Longevity of data stored
    • Cost of managing and processing the data 

Given that 80-90% of the data lives on S3, I’ve seen that 80-90% of analytics will be run on this cheapest data source. That data source is the data lake. And the need to perform a correlation across more data sources comes up only 5-10% of the time for a window of time until the data from those data sources also gets ingested into the data lake. 

As an example: MySQL is the workhorse for operational transactional systems. A complex analytical query with multi-way joins from a federated engine could bring down the operational system. And while there may be a small window of time when the data in MySQL is not available in the data lake, it will eventually be moved over to the lake, and that’s where the bulk of the analysis will happen. 

2. Vendor-neutral Open Source, not Single-vendor Open Source

On top of being a great SQL query engine, Presto is open source and most importantly part of Linux Foundation – governed with transparency and neutrality. On the other hand, Trino, the hard fork of Presto, is a single-vendor project, with most of the core contributors being employees of the vendor. This is problematic for any company planning on using a project as a core component of their data infrastructure, moreso for a vendor like Ahana that needs to be able to support its customers and contribute and enhance the source code. 

Presto is hosted under Linux Foundation in the Presto Foundation, similar to how Kubernetes is hosted by Cloud Native Computing Foundation (CNCF) under the Linux Foundation umbrella. Per the bylaws of the Linux Foundation, Presto will always stay open source and vendor neutral. It was a very important consideration for us that we could count on the project remaining free and open forever, given so many examples where we have seen single-vendor projects changing their licenses to be more restrictive over time to meet the vendor’s commercialization needs. 

For those of you that follow open source, you most likely saw the recent story on Elastic changing its open-source license which created quite a ripple in the community. Without going into all the details, it’s clear that this move prompted a backlash from a good part of the Elastic community and its contributors. This wasn’t the first “open source” company to do this (see MongoDB, Redis, etc.), nor will it be the last. I believe that over time, users will always pick a project driven by a true gold-standard foundation like Linux Foundation or Apache Software Foundation (ASF) over a company-controlled project. Kubernetes over time won the hearts of engineers over alternatives like Apache Mesos and Docker Swarm and now has one of the biggest, most vibrant user communities in the world. I see this happening with Presto.

3. Presto runs in production @ Facebook, runs in production @ Uber & runs in production @ Twitter. 

The most innovative companies in the world run Presto in production for interactive SQL workloads. Not Apache Drill, not SparkSQL, not Apache Hive and not Trino. The data warehouse engine at Facebook is Presto. Ditto that for Uber, likewise for Twitter. 

When a technology is used at the scale these giants run at, you know you are not only getting  technology created by the brightest minds but also tested at internet scale. 

The Conclusion 

And that’s why I picked Presto, that’s PrestoDB and there’s only 1 Presto. 

In summary, I believe that Presto is the de facto standard for SQL analytics on data lakes. It is the heart of the modern open analytics stack and will be the foundation of the next 10 years of open data lake analytics. 

Join us on our mission to make PrestoDB the open source, de facto standard query engine for everyone.

Do I need to move my data to query it with Presto?

No, Presto queries your data in-place so you don’t need to move it. If you’re using AWS S3 for your data lake, for example, you wouldn’t need to ingest it to query as you would if you were using a data warehouse like AWS Redshift. 

To bring Presto compute to your data, you can leverage Ahana Cloud.

With Ahana Cloud, it’s very easy to leverage the power of Presto to query AWS S3. You just connect your data source to Ahana and everything continues to run in your cloud account (called in-vpc). It’s just a click of a button to add data sources to Ahana Cloud for querying. Ahana Cloud would be a replacement for Amazon Athena, EMR Presto, or if you’re running Presto on your own in AWS. It’s a managed service for Presto that takes care of all the configuration, tuning, deployment, managing, attaching/detaching data sources, etc.
You can learn more about Ahana Cloud, and you can also sign up to trial it too.

5 main reasons Data Engineers move from AWS Athena to Ahana Cloud

In this brief post, we’ll discuss the 5 main reasons why data platform engineers decide to move their data analytics workloads from Amazon Athena to Ahana Cloud for Presto.

While AWS Athena’s serverless architecture means users don’t need to scale, provision, or manage any servers, there are trade-offs with a serverless approach around performance, pricing, and several technical limitations.

What are AWS Athena and Ahana Cloud for Presto?

Presto is an open source distributed SQL query engine designed for petabyte-scale interactive analytics against a wide range of data sources, from your data lake to traditional relational databases.

Ahana Cloud for Presto provides a fully managed Presto cloud service in AWS, with a wide range of native Presto connectors support, IO caching, optimized configurations for your workload.

AWS Athena is a serverless interactive query service built on Presto that developers use to query AWS S3-based data lakes and other data sources.

While there are some benefits to AWS Athena, let’s talk about why the data engineers we talk to migrate to Ahana Cloud.

1. Need for Concurrency & Partitions

AWS Athena maximum concurrency is limited to 20-25 queries depending on the region, and users must request increased quotas. Some users even observe a max concurrency nearer to 3. Athena users can only run up to 5 queries simultaneously for each account, and Athena restricts each account to 100 databases. Athena’s partition limit is 20K partitions per table when using the Hive catalog. These limitations pose challenges if you have a complex query in-front of queries that are more latency-sensitive workloads like serving up results to a user-facing dashboard.

Ahana Cloud on the other hand runs any amount of queries when you need them. You have full transparency into what’s going on under the hood. You get unlimited concurrency because you can simply scale the number of distributed workers.

2. Need for Performance predictability

When using AWS Athena you don’t control the number of underlying servers that AWS allocates to Athena to run your queries. As the Athena service is shared, the performance characteristics can change frequently and substantially. One minute there may be 50 servers, the next only 10 servers.

With Ahana Cloud, because you have full control of your deployment, performance is always consistent and many times, faster than Athena.

3. Need for more Data source connectors

AWS Athena doesn’t use native Presto connectors, so you’ll need to use the limited options AWS provides or build your own with the AWS Lambda service.

In Ahana Cloud, you can define and manage data sources in the SaaS console, you can also attach or detach them from any cluster with the click of a button. Connect your existing Amazon database services like RDS / MySQL, RDS / PostgreSQL, Elasticsearch and Amazon Redshift.

4. Need for control over the underlying engine

AWS Athena’s serverless nature may make it easy to use, but it also means users have no control over adding more sessions, resources, debugging, etc.

In Ahana Cloud however, you control the number of Presto nodes in your deployment, and you choose the node instance-types for optimum price/performance. That’s easy with the full visibility provided via dashboards on performance and query management.

5. Need for Price predictability

AWS Athena billing is per query, based on volume of data scanned, making it inefficient and expensive at scale. Because costs are hard to control and predict, it leads to “bill shock” for some users. If one query scans one terabyte, that’s $5 for a few seconds.

Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, increased resilience, speed, and ease of use. Plus, Ahana is pay-as-you-go pricing – only pay for what you use. Using the same example, $5 lets you run a 6 node cluster of r5.xlarge instances for an hour, or hundreds of queries instead of just one.

Summary

AWS Athena Serverless architecture makes it really easy to get started with, however, the service has many different limitations that can cause problems, and many data engineering teams have spent hours trying to diagnose them. Due to these limitations, AWS Athena can run slowly and increase operational costs.

Ahana Cloud for Presto is the first fully integrated, cloud-native managed service for Presto that simplifies the ability of cloud and data platform teams of all sizes to provide self-service, SQL analytics for their data analysts and scientists. And all this without the limits of AWS Athena.

Ahana Cloud is available in AWS. You can sign up and start using our service today for free.

Ahana Cloud for Presto Versus Amazon EMR

In this brief post, we’ll discuss some of the benefits of Ahana Cloud over Amazon Elastic MapReduce (EMR). While EMR offers optionality in the number of big data compute frameworks, that flexibility comes with operational and configuration burden. When it comes to low-latency interactive querying on big data that just works, Ahana Cloud for Presto offers much lower operational burden and Presto-specific optimizations.

Presto is an open source distributed SQL query engine designed for petabyte-scale interactive analytics against a wide range of data sources, from your data lake to traditional relational databases. In fact, you can run federated queries across your data sources. Developed at Facebook, Presto is supported by the Presto Foundation, an independent nonprofit organization under the auspices of the Linux Foundation. Presto is used by leading technology companies, such as Facebook, Twitter, Uber, and Netflix.

Amazon EMR is a big data platform hosted in AWS. EMR allows you to provision a cluster with one or more big data technologies, such as Hadoop, Apache Spark, Apache Hive, and Presto. Ahana Cloud for Presto is the easiest cloud-native managed service for Presto, empowering data teams of all sizes. As a focused Presto solution, here are a few of Ahana Cloud’s benefits over Amazon EMR:

Less configuration. Born of the Hadoop era, Presto has several configuration parameters in several files to configure and tune to get right. With EMR, you have to configure these yourself. With Ahana Cloud, we tune more than 200 parameters out of the box, so when you spin up a cluster, you get excellent query performance from the get go. Out of the box, Ahana Cloud provides an Apache Superset sandbox for administrators to validate connecting to, querying and visualizing your data.

Easy-to-modify configuration. Ahana Cloud offers the ability to not only spin up and terminate clusters, but also stop and restart them—-allowing you to change the number of Presto workers and add or remove data sources. With EMR, any manual changes to the number of Presto workers and data sources require a new cluster or manually restarting the services yourself. Further, adding and removing data sources is done through a convenient user interface instead modifying low-level configuration files.

Optimizations. As a Presto managed service, Ahana Cloud will continually provide optimizations relevant to Presto. For example, Ahana recently released data lake I/O caching. Based on the RubiX open source project and enabled with a single click, the caching eliminates redundant reads from your data lake if the same data is read over and over. This caching results in up to 5x query performance improvement and up to 85% latency reductions for concurrent workloads. Finally, idle clusters processing no queries can automatically scale down to a single Presto worker to preserve costs while allowing for a quick warm up.

If you are experienced at tuning Presto and want full control of the infrastructure management, Amazon EMR may be the choice for you. If simplicity and accelerated go-to-market without needing to manage a complex infrastructure are what you seek, then Ahana Cloud for Presto is the way to go. Sign up for our free trial today.

Streaming Data Processing Using Apache Kafka and Presto

Kafka Quick Start

Kafka is a distributed data streaming framework meant to enable the creation of highly scalable distributed systems. Developed at LinkedIn in 2008 and open-sourced in 2011, it was created to enable the creation of  decoupled yet conceptually connected systems. Broken down to the simplest level, Kafka provides a consistent, fast, and highly scalable log. Specifically, it is a commit log whereby all writes are guaranteed to be ordered and one cannot delete or modify the entries.

Once entries are added to the log, different systems can then process the entries, communicating with each other as needed, most likely by adding entries to the Kafka commit log. This enables the creation of software as a system of systems. Communication and processing happen in parallel and asynchronously, enabling each system to be developed, maintained, scaled, and enhanced as needed. Some of the companies using Kafka include Coursera, Netflix, Spotify, Activision, Uber, and Slack. 

Inner Workings Of Kafka

Kafka consists of producers that send messages to a Kafka node. These messages are grouped by topics to which consumers are subscribed. Each consumer receives all the messages sent to the topics it is subscribed to and carries out further processing as required. All the messages sent to the broker are stored for a given time or until they reach a given size on disk. Deployment is done in a cluster consisting of several brokers to ensure there is no single point of failure.

Messages sent to topics are split into partitions that are replicated in several nodes. The replication factor is determined by the performance and resilience requirements of the data/system. At any moment, one Kafka broker acts as the partition leader that owns the partition.  It is the node to which producers write their messages and consumers read them.

What is Presto?

Presto is a distributed query engine that allows the use of ANSI SQL to query data from multiple data sources. It holds processing and query results in memory, making it extremely efficient and fast. A presto cluster consists of a coordinator node and multiple worker nodes. The worker nodes are responsible for connecting to data stores via plugins/connectors and query processing.

Distributed Data Processing Using Kafka and Presto

Kafka and Presto are normally combined with Kafka providing real-time data pipelines and Presto provisioning distributed querying. This is easily achieved using the Presto Kafka connector that provides access to Kafka topics. It is also possible to have Presto as the producer sending messages to Kafka which are processed by other applications like business intelligence (BI) and machine learning (ML) systems.

To connect Presto and Kafka, you need to have the Kafka cluster running. One then adds a catalog file with the connector.name value set to Kafka, then add the kafka.table-names which lists the topics from the cluster and kafka.nodes property that contains the nodes/s. If multiple Kafka clusters are available, connection with Presto is achieved by adding uniquely named catalog files for each cluster.

Get Started with Presto & Apache Kafka

Business Intelligence And Data Analysis With Druid and Presto

Apache Druid Helicopter View

Apache Druid is a distributed, columnar database aimed at developing analytical solutions. It offers a real-time analytics database able to ingest and query massive amounts of data quickly and store the data safely. It was developed by Metamarkets in 2011, open-sourced in 2012, and made an Apache project in 2015. Some of the companies using Druid include Paypal, Cisco, British Telecom (BT), Reddit, Salesforce, Splunk, Unity, and Verizon.

Druid incorporates ideas from data warehousing, cloud computing, and distributed systems. Its architecture provides many characteristics and features that make it a top candidate for an enterprise, real-time data analysis datastore. Druid runs on a cluster of multiple nodes, offering high scalability, concurrency, availability, and fault tolerance.

The Apache Druid Cluster Architecture

Druid nodes are of various types, each serving a specialized function. Realtime node read and index streaming data, creating segments stored until forwarded to historical nodes. Historical nodes store and read immutable data segments in deep storage like S3 and HDFS. Coordinator nodes handle data management by handling features like segment-historical node assignment, load balancing, and replication.

Overlord nodes handle a Druid’s cluster task and data ingestion management. They assign the tasks to Middle Manager nodes that process the tasks and provide features like indexing in real-time. Broker nodes provide an interface between the cluster and clients and accept queries, send them to the appropriate real-time/historical nodes, accept the query results, and return the final results to the client. Druid has optional Router nodes providing proxying services for request management to Overlord and Coordinator nodes and query routing services to Broker nodes.

What is Presto?

Presto is an open source SQL query engine built for data lake analytics and ad hoc query. It was developed to meet Facebook OLAP needs against their Hive data lake. Its design goals include fast and parallel query processing, creating a virtual data warehouse from disparate datastores via a plugin architecture, and having a highly scalable and distributed query engine. 

Presto is deployed in production as a cluster of nodes for improving the performance and scalability.

Druid and Presto Data Analysis Application

Druid and Presto are usually combined to create highly scalable, parallel, distributed real-time analytics, business intelligence (BI), and online analytical processing (OLAP) solutions. Since both platforms are open source, users can enjoy their power without investing in purchase/licensing costs if they’re ok managing both on their own. Having Druid processing real-time data and handling ad-hoc querying enables real-time analytics to be realized on a Presto-powered stack. Presto allows users to perform join queries from disparate data sources. Therefore, they can select the datastore that best meets their diverse needs e.g. online transaction processing (OLTP) databases like MySQL, document-orient databases like MongoDB, or/and geospatial databases like PostGIS.

Integrating Druid with Presto is done via the Presto Druid connector. This requires the creation of a catalog properties file that configures the connection. The first property is the connector.name property that needs to be set to druid.the druid.broker-url and druid.coordinator-url accepts the URL to the broker and coordinator respectively in the hostname:port format. Query pushdown is enabled by setting druid.compute-pushdown-enabled to true.

Get Started with Presto & Druid

Flexible And Low Latency OLAP Using Apache Pinot and Presto for real time analytics

Apache Pinot Overview

Apache Pinot is a distributed, low latency online analytical processing (OLAP) platform used for carrying out fast big data analytics. Developed at LinkedIn in 2014, the highly scalable platform is meant to power time-sensitive analytics and is designed to have low latency and high throughput. It was open-sourced in 2015 and incubated by the Apache Software Foundation in 2018. Some of its use cases include high dimensional data analysis, business intelligence (BI), and providing users with profile view metrics. Other companies using Pinot include Uber, Microsoft, Target, Stripe, and Walmart.

Simplified View Of How Apache Pinot Works

Pinot is meant to be highly scalable and distributed while providing high throughput and fast turnaround time. To achieve this, related data from streaming sources like Kafka and data lakes like S3 are stored in tables. The tables are split into segments that are sets containing non-changing tuples. Segments are stored in a columnar manner and additionally contain metadata, zone maps, and indices related to contained tuples. Segments are stored and replicated among Pinot server nodes. Controller nodes contain global metadata related to all segments in a cluster like server node to segment mapping.

Pinot consists of four main components namely brokers, servers, minions, and controllers. The controller handles cluster management, scheduling, resource allocation, and a REST API for administration. The Pinot broker is responsible for receiving client queries, sending them to servers for execution, and returning the results of the queries to the client. Servers have segments that store data and handle most of the distributed processing. They are divided into offline and real-time servers, with offline servers typically containing immutable segments and real-time servers that ingest data from streaming sources. Minions are used for maintenance tasks not related to query processing like periodically purging data from a Pinot cluster for security and regulatory compliance reasons.

What is Presto?

Presto is a fast query engine able to handle processing in a parallel and distributed manner. It’s an open source, distributed SQL query engine.

Presto architecture consists of a coordinator node and multiple worker nodes. The coordinator node is responsible for accepting queries and returning results. The worker nodes do the actual computation and connect to the data stores. This distributed architecture makes Presto fast and scalable.

Fast and Flexible OLAP With Pinot and Presto

When carrying out analytics, system designers and developers normally have to make a tradeoff between querying flexibility and fast response times. The more flexible a system is, the slower its response time. Pinot is extremely fast but has limited flexibility while Presto is a bit slower but offers more flexibility. Having a Pinot cluster as the storage layer and a Presto cluster as the querying layer provides users with high throughput, low latency storage and powerful, flexible querying. Integration is achieved using an open source Presto Pinot connector that is responsible for managing connections and mapping queries and their results between the two platforms. Optimization is achieved by query pushdown to Pinot with Presto offering features lacking in Pinot like table joins.

You can learn more about the Apache Pinot connector for Presto in the PrestoCon session presented by the Apache Pinot team.

Get Started with Apache Pinot & Presto

Turbocharge your Analytics with MongoDB And Presto

High-Level View Of MongoDB

MongoDB is a NoSQL distributed document database meant to handle diverse data management requirements. Its design goals include creating an object-oriented, highly available, scalable, efficient, and ACID (Atomicity, Consistency, Isolation, and Durability) featuring database. Its document model enables data to be stored in its most natural form as opposed to the relational model, making users more productive. It supports both schemaless and schema design, offering both flexibility as well as data integrity and consistency enforcement as needed. Some of the organizations using MongoDB include Google, SAP, Verizon, Intuit, Sega, Adobe, InVision, and EA Sports.

A Look At MongoDB Architecture

MongoDB stores data in documents in the Binary JSON (BSON) format. Logically related documents are grouped into collections that are indexed. Mongodb servers that store data form shards are grouped into replica sets. Replica sets have the same data replicated among them, with the default replication factor being 3 servers. Data is partitioned into chunks, which combined with sharding and replication provides high reliability and availability. During partitioning, consistency is ensured by having the database write unavailable. Config servers have configuration data and metadata related to the MongoDB clusters. Mongo’s Routers accept queries and return results to clients and are responsible for directing queries to the correct shards. 

MongoDB Deployment

MongoDB is cross-platform and can be installed on all major operating systems. It can either be installed manually, deployed on private and/or public clouds, or accessed via premium cloud offerings. Recommended practice in production is to have multiple nodes running MongoDB instances, forming a cluster.

What is Presto?

Presto is an open source SQL query engine that provides a scalable and high throughput query engine capable of accessing different data stores including MySQL, DB2, Oracle, Cassandra, Redis, S3, and MongoDB. This enables the creation of a virtualized data lake of all data. Combining Presto with MongoDB creates a highly scalable and cohesive yet loosely decoupled data management stack.

Scalable Analytics With MongoDB and Presto

Combining MongoDB and Presto provides a highly scalable tech stack for developing distributed analytical applications. MongoDB is an enterprise distributed database capable of storing data as strictly as users need it to be and ensure high horizontal scalability, availability, resilience, and self-healing. Designers and developers can choose the data model that best serves them, trading flexibility for strictness in the schema design and performance for transactional integrity in write operations. Different clusters can be created as needed to meet different goals as per performance and functional needs.

For example, writes can be unacknowledged, acknowledged, or replica-acknowledged, with faster writes being achieved with weaker write enforcement. Reads can be performed from secondary, primary-preferred, and primary nodes for a tradeoff between turnaround times and fetching stale data. This makes it a great storage layer for OLAP systems. Data can be persisted as accurately or read as fast as possible as per each application’s need. Integration is achieved using the Presto MongoDB connector.

Can you insert a JSON document into MongoDB with Presto?

This question comes up quite a bit. In short, yes you can do this. You’d be running an insert statement from Presto to Mongo. If you use Presto, you’d insert it as a table. For example:

INSERT INTO orders VALUES(1, 'bad', 50.0, current_date);

That insert would go into MongoDB as a JSON document.

Getting started with Presto in the cloud

If you want to get started with Presto quickly in the cloud, try out Ahana Cloud for free. Ahana takes care of the deployment, management, adding/detaching data sources, etc. for you. It’s a managed service for Presto that makes it really easy to get started. You can try it free at https://ahana.io/sign-up 

CRN® Recognizes Ahana on Its 2021 Big Data 100 List As One of The Coolest Business Analytics Companies

CRN logo

Ahana also named to CRN’s 10 Hot Big Data Companies You Should Watch in 2021 list

San Mateo, Calif. – May 5, 2021 — Ahana, the self-service analytics company for Presto, announced today that CRN®, a brand of The Channel Company®, recognized Ahana on its 2021 Big Data 100 list as one of theCoolest Business Analytics Companies. This annual list recognizes the technology vendors that go above and beyond by delivering innovation-driven products and services for solution providers that in turn help enterprise organizations better manage and utilize the massive amounts of business data they generate.

This recognition follows Ahana’s recent distinction by CRN as one of 10 Hot Big Data Companies You Should Watch in 2021. “We are honored to receive these prestigious accolades from one of the industry’s most influential media sources,” said Steven Mih, Cofounder and CEO, Ahana. “This is another validation of tremendous growth in users of the open source Presto project and the innovation of Ahana Cloud for Presto, which brings the power of the most powerful open source distributed SQL query engine to any organization.”

Ahana Cloud for Presto is the first and only cloud-native managed service for Presto on Amazon Web Services (AWS), giving customers complete control and visibility of clusters and their data. Presto is an open source distributed SQL query engine for data analytics. With Ahana Cloud, the power of Presto is now accessible to any data team of any size and skill level.

A team of CRN editors compiled this year’s Big Data 100 list by identifying IT vendors that have consistently made technical innovation a top priority through their offering of products and services for business analytics, systems and platforms, big data management and integration tools, database systems, and data science and machine learning. Over the years, the Big Data 100 list has become an invaluable resource for solution providers that trust CRN to help them find vendors that specialize in data intelligence, insights, and analytics.

“IT vendors featured on CRN’s 2021 Big Data 100 list have demonstrated a proven ability to bring much-needed innovation, insight and industry expertise to the solution providers and customers that need it most,” said Blaine Raddon, CEO of The Channel Company. “I am honored to recognize these companies for their unceasing commitment toward elevating and improving the ways businesses gain value from their data.”

The 2021 Big Data 100 list is available online at https://www.crn.com/news/cloud/the-big-data-100-2021

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto for Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Ahana is a premier member of Linux Foundation’s Presto Foundation and actively contributes to the open source PrestoDB project. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

About The Channel Company®

The Channel Company enables breakthrough IT channel performance with our dominant media, engaging events, expert consulting and education, and innovative marketing services and platforms. As the channel catalyst, we connect and empower technology suppliers, solution providers, and end users. Backed by more than 30 years of unequaled channel experience, we draw from our deep knowledge to envision innovative new solutions for ever-evolving challenges in the technology marketplace. www.thechannelcompany.com

Follow The Channel Company®: Twitter, LinkedIn, and Facebook

© 2021 The Channel Company, LLC. CRN is a registered trademark of The Channel Company, LLC. All rights reserved.

The Channel Company Contact:

Jennifer Hogan

The Channel Company

jhogan@thechannelcompany.com

Ahana Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

How do I sync my partition and metastore in Presto?

Sync partition metadata is used to sync the metastore with information on the file system/s3 for the external table. Depending upon the number of partitions the sync can take time.

Here is a quick reference from the presto docs: https://prestodb.io/docs/current/connector/hive.html?highlight=sync_partition_metadata

Procedures:

  • system.create_empty_partition(schema_name, table_name, partition_columns, partition_values)
    Create an empty partition in the specified table.
  • system.sync_partition_metadata(schema_name, table_name, mode, case_sensitive)
    Check and update partitions list in metastore. There are three modes available:
    • ADD : add any partitions that exist on the file system but not in the metastore.
    • DROP: drop any partitions that exist in the metastore but not on the file system.
    • FULL: perform both ADD and DROP.

The case_sensitive argument is optional. The default value is true for compatibility with Hive’s MSCK REPAIR TABLE behavior, which expects the partition column names in file system paths to use lowercase (e.g. col_x=SomeValue). Partitions on the file system not conforming to this convention are ignored, unless the argument is set to false.

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.