Joining forces to further Presto together: Ahana acquired by IBM
Today I am thrilled to finally share more about Ahana joining forces with IBM. A few weeks ago, we shared with the Linux Foundation Presto open source community that IBM would be joining the Presto Foundation through the acquisition of Ahana. This acquisition is a significant milestone for us and marks the beginning of an exciting new chapter in Ahana’s journey.
Three years ago I co-founded this company with the belief that Presto was the best engine for SQL-based data lake analytics. And I couldn’t be prouder of what we accomplished over those three years. We’ve built an incredible product in Ahana Cloud, the managed service for Presto, hired an extraordinary team that continues to drive on our vision of Presto as the SQL engine for the open data lakehouse, and helped grow the community and drive the technical roadmap for the open source Presto project.
Over the last few months, I spent a lot of time with the IBM team. As a market leader in data and AI (see The Forrester Wave: Data Management for Analytics), IBM is at the forefront of the data and analytics market and helps many customers build out their data & analytics platforms. I learned we shared a vision on the future of data and analytics and why the Open Data Lakehouse will play a crucial role.
And perhaps most importantly, we also both firmly believe in the open governance model of the Presto open source project. IBM has a rich history of open source contributions, which became widely known with its acquisition of Red Hat in 2018. But even before that, IBM has worked closely with Linux, Apache and Eclipse since the late 90s and is now a prominent contributor to many open source communities like Spark, Docker, CloudFoundry, Kubernetes, OpenJS, and more. They were a founding member of Linux Foundation’s Cloud Native Computing Foundation (CNCF), which fostered the growth of Kubernetes.
In fact, one of the big reasons IBM chose to support Presto was because it’s a Linux Foundation project, neutrally governed across a consortium of industry leaders like Meta, Uber, Intel, HPE and more, unlike other projects that are vendor-led and not as open and transparent. We invite everyone to join, as we know that openly governed, community-driven OSS projects will continue to be on the right side of history.
Together we believe it’s 1 + 1 = 3. Meaning that together, we will make Presto better than it’s ever been before. Bringing our technical expertise together based on the common vision of Open Data Lakehouses enables us to contribute innovations faster than it’s ever been in the open source Presto project. And we believe this is good for the developer and user community!
Today, IBM announced a new AI and data platform called watsonx to help organizations build and scale AI. Part of that platform is a data store built on an open data lakehouse architecture called watsonx.data that enables enterprises to easily unify and govern their structured and unstructured data, wherever it resides, for high performance AI and analytics. At the heart of watsonx.data is Presto, and I couldn’t be more excited.
As an IBM company, our mission remains to make Presto the best engine for the Open Data Lakehouse. We’ll continue to be heavily involved in the Presto Foundation to help drive the Presto open source technical roadmap as well as community programs like PrestoCon.
And now, we can accelerate our data lakehouse go-to-market to enterprise customers and focus on building out hybrid cloud and multi cloud deployments. The opportunity is massive, and we are so excited to grow Presto bigger to what it’s ever been!
I want to assure you that our commitment to our vision remains unchanged. We are excited to now be a part of IBM to create even more value for customers and community members.
Thank you to all of our customers, partners, and community stakeholders. Thank you to the Meta team Victoria Dudin, Sara Narayan, and all of Meta’s Presto and Velox technical members including Tim Meehan, James Sun, Wenlei Xie, Orri Erling, Pedro Pedreira, Rebecca Schlussel, Masha Basmanova, and Biswapesh Chattopadhya; community developers Zhenxiao Luo, Rongrong Zhong, Beinan Wang, Bin Fan, Chunxu Tang, and James Petty; Arijit Bandyopadhyay, Susan Marquez, and Dave Cohen at Intel and their Intel ISV Disrupter/Accelerator Program; Girish Baliga at Uber; Michael Dolan, Chris Aniszczyk, and Michelle Martineau at Linux Foundation; and Vinoth Chandar and the Apache HUDI team at Onehouse. We could not have achieved this level of success without you. We are excited to continue working with you and to provide you with even more value in the future.
Thank you to our investors, starting with Dave Munichiello at GV for the first check and being instrumental in guiding the early days. Mark Leslie of Leslie Ventures for providing timely, invaluable board guidance. A seed+ lead check and advice from Brandon Reeves of Lux Capital. Rob Schwartz and Dan Moskowitz of Third Point for being the Series A lead and providing insightful board guidance as well. Ankur Prakash of Liberty Global Ventures for believing in me again, leading an extension. And investor and advisor, Bob Wiederhold, whose open source and startup expertise has been incredibly helpful.
And finally, thank you to all of the Ahana team, including my co-founders Dave Simmen, Dipti Borkar, Ashish Tadose, Vivek Bharathan, and George Wang. We accomplished a lot over the last three years, and I’m incredibly humbled and proud to have worked with each and every one of you.
We are confident that this acquisition will be a positive step forward for Presto. We look forward to sharing more with you as we embark on this exciting new chapter.
Thank you for being a part of this journey with us. If you haven’t yet, I encourage you to join the Presto community, the best place to meet others is in the Presto community Slack.
We can’t wait to see what the future holds!
Steven CEO & Co-Founder, Ahana
The top 5 reasons a data engineer should use Presto
Open source software has become very widely used over the last decade or so. It’s not just developers and engineers adopting open source – companies ranging from the Fortune 500 to 10-person start ups are relying on open source software for many different parts of their business. In fact, as Ted Dunning recently put it, “every company in the world that uses software uses open source software”.
In this blog, I’ll share more about the open source SQL engine Presto and why thousands of engineers and companies use Presto for ad hoc and interactive analytics. In fact, I’ll give you five reasons why you should use Presto in your data platform.
At a high level, the Presto engine was built to handle data processing in memory, eliminating the need for a disc input-output transaction. Because Presto doesn’t typically care what storage you use, you can quickly join or aggregate datasets across a range of data sources for a unified view of your data to query against.
Why does this matter? If you can read data more swiftly, the performance of your queries improves correspondingly—always a good thing when you have business analysts, executives, and customer reports that need to be made available regularly.
Let’s look at 5 reasons why you should consider Presto for your data platform.
1. Query Federation
This architecture depicts a federated query across multiple data sources with Presto
Presto provides a single unified SQL dialect that abstracts all supported data sources. This is a powerful feature which eliminates the need for users to understand connections and SQL dialects of underlying systems.
With Presto, we can write queries that join multiple disparate data sources without moving the data – it’s all queried by unified ANSI SQL.
2. Fast, Reliable & Efficient
Data infrastructure costs can explode, especially with proprietary systems like data warehouses ( i.e Snowflake, Redshift), as the data size and users’ workloads grow.
Presto is battle-tested at Meta and Uber and can scale to meet growing data sizes and workloads. It’s faster and more efficient than other engines because it’s optimized for large numbers of small queries, so you can query data at better price-performance compared to proprietary systems.
3. Presto is not tied to your storage
Because Presto’s design separates the query engine layer from your storage layer (HDFS, MySQL, S3, etc.), you can independently scale either layer depending on what your workload needs are. Storage and compute are not tied to one another. Presto gives data engineers and architects tremendous flexibility.
4. Unified SQL interface
SQL is the oldest and the most widely-used language for data analysis. Analysts, data engineers and data scientists use SQL for exploring data, building dashboards, and testing hypotheses with notebooks like Jupyter and Zeppelin, or with BI tools like Tableau, PowerBI, and Looker.
Presto has the ability to query data not just from distributed file systems, but also from other sources such as NoSQL stores like Cassandra, Elasticsearch, and RDBMS and even message queues like Kafka (and all of these data sources are queryable by standard ANSI SQL).
5. Open Source
Presto is open source. It’s part of the Linux Foundation, a gold standard for open source projects. The project is neutrally governed, meaning no one company or individual can dictate or control the roadmap. In fact, the Presto Foundation is made up of a consortium of industry leaders (Meta, Uber, Alibaba, Twitter, Intel, HPE and many more) who work together to advance the project forward. For many, an important in the decision-making process of using open source is ensuring that a project is not solely corporate-backed and is truly open to the broader community.
Presto: The Open Source SQL Query engine for better insights
We’ve talked about how Presto enables self-service ad-hoc analytics for its users handling large amounts of data, how you can query data where it lives, its ability to scale independently based on storage and compute and its intuitive SQL interface. These are the reasons a data engineer should consider using Presto for their data architecture.
At Ahana, we’ve built a managed service for Presto on AWS. As you can tell, Presto is very powerful. But, it’s also very complex and can be complicated to manage. We abstract those complexities and take care of configuring, tuning, and managing Presto under the hood so you can focus on driving data analytics for your organization. We have a 14-day free trial if you want to check it out, get started at https://ahana.io/get-started/ or check out a demo at https://ahana.io/ahana-cloud-demo-registration/
How to Cut the Cost of Your Lakehouse Solution
Lakehouse solutions are becoming more and more popular as an augmentation or replacement for expensive locked-in data warehouses. However, many organizations still struggle with the cost of these implementations. Let’s discuss how to reduce the cost of your lakehouse solution. We will look at the drivers of cost and how open source can help reduce expenses. We will also examine the biggest cost drivers in a lakehouse solution and how they can be mitigated.
Leveraging an open data lakehouse offers countless advantages, from its distinct compute and storage architecture to a lack of vendor lock-in. Plus, you gain the freedom to opt for whichever engine is best suited for your needs and cut costs along the way!
As separated storage has become more affordable and available, the compute engines have been a major driving cost factor associated with data lakehouses. When building a data lakehouse, storage, metadata catalog, and table/data management are not the components that drive significant increase in costs. Compute, however, is a major factor as the number of jobs and queries that need to be executed continue to increase, necessitating more hardware and software, and increasing costs significantly
Fortunately, the majority of distributed computing engines, like Presto, are available as open source software that can be used absolutely free! All you have to pay for are servers or cloud instances. Although all computation engines share similar functions, certain ones have a more optimized design due to their underlying technology. These are far more powerful than others in terms of efficiency, resulting in significant cost saving due to the lower volume of servers required.
The open source presto engine is very efficient and becoming more efficient as the compute workers leverage native C++ vectorization technologies. Compared with other systems that run a Java virtual machine, native C++ code is drastically more efficient. The reason C++ is faster than Java is because C++ is compiled, whereas Java is interpreted. By utilizing C++, developers can take advantage of superior memory allocation for increased speed and efficiency. Java Virtual Machine (JVM) is susceptible to infamous garbage collection storms whereas C++ does not suffer from this issue. A perfect example of this is Apache SparkSQL, which leverages Java Virtual Machines as well as Databrick’s recently introduced proprietary Photon engine that utilizes C++.
Running AWS Lakehouse with Presto can potentially reduce your compute cost by ⅔. Let’s take a look at a sample comparison of running an AWS Lakehouse with another solution vs. with Presto. Let’s consider a 200TB lakehouse with 20 nodes of Presto, using current AWS pricing (December 2022): 20 X r5.8xl instances = $40/hour
If used for 30 days, the compute would be $29K per month.
200TB of S3 per month = $4K per month.
Setting aside the data transfer charges, you’ll be spending 88% on the compute.
So if you have a compute engine that is 3 times more efficient, you would end up with 1/3 the compute nodes for the same workload:
7 X r5.8xl instances = $14/hour
If used for 30 days, the compute would be $10K per month.
200TB of S3 per month = $4K per month.
Furthermore, though it does not account for the insignificant data transfer or metadata fees, these are negligible.
So the comparison would be $33K vs. $14K
The total savings would be on the order of 60% cost savings.
When you are looking for a lakehouse, do you think about getting an all in one solution from one vendor? If so, you may be missing out on a great opportunity. By unbundling the lakehouse and using open components that work well together, you can get all of the benefits of owning a lakehouse without breaking the bank. One of the key components of this strategy is the Presto SQL query engine. Let’s take a closer look at what presto can do for you!
With the rising popularity of data lakehouses, businesses should consider unbundling the ecosystem for greater efficiency and cost savings. A main advantage of a data lakehouse is its capability to process a variety of compute workloads as your organization’s needs evolve over time. Computing workloads can be divided into SQL queries or non-SQL based code, which could be used for Machine Learning training or data manipulation. Most firms realize that SQL is an ideal tool to help their analysts and developers explore data more effectively. Oftentimes, they begin with the introduction of a well–functioning SQL platform rather than other advanced workloads like ML training.
By leveraging open-source presto, organizations are able to create a SQL Lakehouse and provide fast, reliable SQL queries. These presto-based systems are often more cost-effective than a bundled, single-vendor option which tend to have closed-source components, like a proprietary engine and metadata catalog.
Unbundling your lakehouse offers a number of distinct benefits. For the SQL lakehouse example, Linux Foundation Presto can be used as the powerful open source query engine, Apache Hudi for the table format, Apache Hive Metastore for the catalog, all open components without lock in. Additionally, unbundling gives platform engineers the opportunity to opt for other open source projects that are quickly evolving to find the more cost-effective platform at any given time. Therefore, unbundling can provide unprecedented levels of functionality, scalability, flexibility, and performance at a decreased cost compared to traditional single-vendor lakehouse offerings.
Presto is a distributed query engine designed to enable fast, interactive analytics on all your data. Presto is an open-source system and has several key advantages over closed-source offerings: performance, scalability, and reliability. Use Presto for your unbundled SQL Lakehouse.
AWS Athena Alternatives: Best Amazon Athena Alternatives
Looking for Athena alternatives? Ahana gives you the scale and performance of PrestoDB – the same underlying technology that powers Athena, and which is used for petabyte-scale analytics at Meta and Uber – with none of the limitations. Get better price/performance and regain control over your cloud infrastructure with Ahana’s managed Presto solution for AWS. Request a demo today!
This is the 4th blog in our comparing AWS Athena to PrestoDB series. If you missed the others, you can find them here:
If you’re looking for Amazon Athena alternatives, you’ve come to the right place. In this blog post, we’ll explore some of the best AWS Athena alternatives out there.
Athena is a great tool for querying data stored in S3 – typically in a data lake or data lakehouse architecture – but it’s not the only option out there. There are a number of other alternatives that you might want to consider, including serverless options such as Ahana or Presto, as well as cloud data warehouses.
Each of these tools has its own strengths and weaknesses, and really the best choice depends on the data you have and what you want to do with it. In this blog post, we’ll compare Athena with each of these other options to help you make the best decision for your data.
What is AWS Athena?
AWS Athena is an interactive query service based on Presto that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage. Amazon Athena is great for interactive querying on datasets already residing in S3 without the need to move the data into another analytics database or a cloud data warehouse. Athena (engine 2) also provides federated query capabilities, which allows you to run SQL queries across data stored in relational, non-relational, object, and custom data sources.
Why would I not want to use AWS Athena?
There are various reasons users look for alternative options to Athena, in spite of its advantages:
Performance consistency: Athena is a shared, serverless, multi-tenant service deployed per-region. If too many users leverage the service at the same time in a region, users across the board start seeing query queuing and latencies. Query concurrency can be challenging due to limits imposed on accounts to avoid users from overwhelming the regional service.
Cost per query: Athena charges based on Terabytes of data scanned ($5 per TB). If your datasets are not very large, and you don’t have a lot of users querying the data often, Athena is the perfect solution for your needs. If however, your datasets are large in the order of hundreds or thousands of queries, scanning over terabytes or petabytes of data Athena may not be the most cost-effective choice.
Visibility and Control: There are no knobs to tweak in terms of capacity, performance, CPU, or priority for the queries. You have no visibility into the underlying infrastructure or even into the details as to why the query failed or how it’s performing. This visibility is important from a query tuning and consistency standpoint and even to reduce the amount of data scanned in a query.
Security: In spite of having access controls via IAM and other AWS security measures, some customers simply want better control over the querying infrastructure and choose to deploy a solution that provides better manageability, visibility, and control.
Feature delays: Presto is evolving at an expedited rate, with new performance features, SQL functions, and optimizations being contributed by the community as well as companies such as Facebook, Alibaba, Uber, and others periodically. Amazon caught up with version 0.217 only in Nov 2020. With the current version of Presto DB being 0.248, if you need the performance, features, and efficiencies that newer versions provide you are going to have to wait for some time.
What are the typical alternatives to AWS Athena?
DIY open-source PrestoDB
Managed Hadoop and Presto
Managed Presto Service
Cloud data warehouse such as Redshift or Snowflake
Depending upon a user’s business need and the level of control desired users, leverage one or more of the following options:
DIY open-source PrestoDB
Instead of using Athena, users deploy open-source PrestoDB in their environment (either On-Premises or in the Cloud). This mode of deployment gives the user the most amount of flexibility in terms of performance, price, and security; however, it comes at a cost. Managing a PrestoDB deployment requires expertise and resources (personnel and infrastructure) to tweak, manage and monitor the deployment.
Large scale DIY PrestoDB deployments do exist at enterprises that have mastered the skills of managing large-scale distributed systems such as Hadoop. These are typically enterprises maintaining their own Hadoop clusters or companies like FAANG (Facebook, Amazon, Apple, Netflix, Google) and tech-savvy startups such as Uber, Pinterest, just to name a few.
The cost of managing an additional PrestoDB cluster may be incremental for a customer already managing large distributed systems, however, for customers starting from scratch, this can be an exponential increase in cost.
Managed Hadoop and Presto
Cloud providers such as AWS, Google, and Azure provide their own version of Managed Hadoop.
AWS provides EMR (Elastic Map Reduce), Google provides Data Proc and Azure provides HDInsight. These cloud providers support compatible versions of Presto that can be deployed on their version of Hadoop.
This option provides a “middle ground” where you are not responsible for managing and operating the infrastructure as you would traditionally do in a DIY model, but instead are only responsible for the configuration and tweaks required. Cloud provider-managed Hadoop deployments take over most responsibilities of cluster management, node recovery, and monitoring. Scale-out becomes easier at the push of a button, as costs can be further optimized by autoscaling using either on-demand or spot instances.
You still need to have the expertise to get the most of your deployment by tweaking configurations, instance sizes, and properties.
Managed Presto Service
If you would rather not deal with what AWS calls the “undifferentiated heavy lifting”, a Managed Presto Cloud Service is the right solution for you.
Ahana Cloud provides a fully managed Presto cloud service, with a wide range of native Presto connectors support, IO caching, optimized configurations for your workload. An expert service team can also work with you to help tune your queries and get the most out of your Presto deployment. Ahana’s service is cloud-native and runs on Amazon’s Elastic Kubernetes Service (EKS) to provide resiliency, performance, scalability and also helps reduce your operational costs.
A managed Presto Service such as Ahana gives you the visibility you need in terms of query performance, instance utilization, security, auditing, query plans as well as gives you the ability to manage your infrastructure with the click of a button to meet your business needs. A cluster is preconfigured with optimum defaults and you can tweak only what is necessary for your workload. You can choose to run a single cluster or multiple clusters. You can also scale up and down depending upon your workload needs.
Ahana is a premier member of the Linux Foundation’s Presto Foundation and contributes many features back to the open-source Presto community, unlike Athena, Presto EMR, Data Proc, and HDInsight.
Cloud Data Warehouse (Redshift, Snowflake)
Another alternative to Amazon Athena would be to use a data warehouse such as Snowflake or Redshift. This would a require a shift of paradigm from a decoupled open lakehouse architecture to a more traditional design pattern focused on a centralized storage and compute layer.
If you don’t have a lot of data and are mainly looking to run BI-type predictable workloads (rather than interactive analytics), storing all your data in a data warehouse such as Amazon Redshift or Snowflake would be a viable option. However, companies that work with larger amounts of data and need to run more experimental types of analysis will often find that data warehouses do not provide the required scale and cost-performance benefits and will gravitate towards a data lake.
In these cases, Athena or Presto can be used in tandem with a data warehouse and data engineers can choose where to run each workload on an ad-hoc basis. In other cases, the serverless option can replace the data warehouse completely.
Presto vs Athena: To Summarize
You have a wide variety of options regarding your use of PrestoDB.
If maximum control is what you need and you can justify the costs of managing a large team and deployment, then DIY implementation is right for you.
On the other hand, if you don’t have the resources to spin up a large team but still want the ability to tweak most tuning knobs, then a managed Hadoop with Presto service may be the way to go.
Here, we talk about AWS Athena vs Glue, which is an interesting pairing as they are both complementary and competitive. So, what are they exactly?
Using JMeter with Presto
Apache JMeter is an open source application written in Java that is designed for load testing. This article presents how to install it, and how to create and run a test plan for testing SQL workloads on Presto clusters.
You will need Java on the system you are installing JMeter on. If you do not have Java installed on your system, see How do I install Java?
You will need a Presto cluster to configure JMeter to connect to and run the test plan on. You can create a Presto cluster for free in Ahana Cloud Community Edition.
Installing JMeter
To install Jmeter, start by downloading the latest JMeter build and unzipping the downloaded file into a new directory. For this article, the new directory’s name is jmeter.
💡 As of this writing, the Presto JDBC driver version to use with an Ahana Presto cluster is presto-jdbc-0.272.jar. To find the Presto version for a cluster in Ahana, open the Manage view of the Presto cluster and look for Version of Presto in the Information section. The version shown will be similar to 0.272-AHN-0.1. Use the first four numbers to choose the Presto JDBC driver to download.
Copy the downloaded Presto JDBC driver jar file into the jmeter directory’s lib folder.
To run JMeter, change directories to the directory and run the command
bin/jmeter
Create a Test Plan in JMeter
In JMeter, select the Templates icon to show the Templates window. In the dropdown of the Templates window, select JDBC Load Test, then select Create.
Enter the JDBC endpoint in Database URL. In Ahana, you can find and copy the JDBC endpoint in Connection Details of the Presto cluster.
You can include either or both of the catalog and schema names in the Database URL separated by slashes after the port number. For example:
If not, you must specify the catalog and schema names in the SQL query in JDBC Request.
Enter com.facebook.presto.jdbc.PrestoDriver in JDBC Driver class.
Enter Username and Password of a Presto user attached to the Presto cluster.
In the navigation bar on the left, expand Thread Group, select JDBC Request, and enter the SQL query in Query.
💡 Do not include a semicolon at the end of the SQL query that you enter, or the test plan run will fail.
Set how many database requests run at once in Number of Threads (users).
In Ramp-up period (seconds), enter the time that JMeter should take to start all of the requested threads.
Loop Count controls how many times the thread steps are executed.
For example, if Number of Threads (users) = 10, Ramp-up period (seconds) = 100, and Loop Count = 1, JMeter creates a new thread every 10 seconds and the SQL query runs once in each thread.
You can add a report of the performance metrics for the requests to the test plan. To do so, right-click Test Plan in the left navigation bar, then select Add → Listener → Summary Report.
Select the Save icon in JMeter and enter a name for the test plan.
Run the Test Plan and View Results
To run the test plan, select the Start icon.
In the left navigation bar, select View Results Tree or Summary Report to view the output of the test plan.
Run JMeter in Command Line Mode
For best results from load testing it is recommended to run without the GUI. After you have created and configured a test plan using the GUI, quit JMeter then run it from the command line.
For an example, to run JMeter with a test plan named testplan, and create a report in a new directory named report, run the following command:
To learn how to run Presto on Kubernetes, let’s cover the basics first. Presto is a distributed query engine designed from the ground up for data lake analytics and interactive query workloads.
Presto supports connectivity to a wide variety of data sources – relational, analytical, NoSQL, object stores including s search and indexing systems such as elastic and druid.
The connector architecture abstracts away the underlying complexities of the data sources whether it’s SQL, NoSQL or simply an object store – all the end user needs to care about is querying the data using ANSI SQL; the connector takes care of the rest.
How is Presto typically deployed?
Presto deployments can be found in various flavors today. These include:
Presto on Hadoop: This involves Presto running as a part of a Hadoop cluster, either as a part of open source or commercial Hadoop deployments (e.g. Cloudera) or as a part of Managed Hadoop (e.g. EMR, DataProc)
DIY Presto Deployments: Standalone Presto deployed on VMs or bare-metal instances
Serverless Presto (Athena): AWS’ Serverless Presto Service
Presto on Kubernetes: Presto deployed, managed and orchestrated via Kubernetes (K8s)
Each deployment has its pros and cons. This blog will focus on getting Presto working on Kubernetes.
All the scripts, configuration files, etc. can be found in these public github repositories:
You will need to clone the repositories locally to use the configuration files.
git clone <repository url>
What is Kubernetes (K8s)?
Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes groups containers that make up an application into logical units for easy management and discovery.
In most cases deployments are managed declaratively, so you don’t have to worry about how and where the deployment is running. You simply declaratively specify your resource and availability needs and Kubernetes takes care of the rest.
Why Presto on Kubernetes?
Deploying Presto on K8s brings together the architectural and operational advantages of both technologies. Kubernetes’ ability to ease operational management of the application significantly simplifies the Presto deployment – resiliency, configuration management, ease of scaling in-and-out come out of the box with K8s.
A Presto deployment built on K8s leverages the underlying power of the Kubernetes platform and provides an easy to deploy, easy to manage, easy to scale, and easy to use Presto cluster.
Getting Started with Presto on Kubernetes
Local Docker Setup
To get your bearings and see what is happening with the Docker containers running on Kubernetes, we will first start with a single node deployment running locally on your machine. This will get you familiarized with the basic configuration parameters of the Docker container and make it way easier to troubleshoot.
Feel free to skip the local docker verification step if you are comfortable with docker, containers and Kubernetes.
Kubernetes / EKS Cluster
To run through the Kubernetes part of this tutorial, you need a working Kubernetes cluster. In this tutorial we will use AWS EKS (Elastic Kubernetes Service). Similar steps can be followed on any other Kubernetes deployment (e.g. Docker’s Kubernetes setup) with slight changes e.g. reducing the resource requirements on the containers.
If you do not have an EKS cluster and would like to quickly get an EKS cluster setup, I would recommend following the instructions outlined here. Use the “Managed nodes – Linux” instructions.
Before you go about kicking off your Presto cluster, you want to make sure you have node groups created on EKS with sufficient capacity.
After you have your EKS cluster created (in my case it’s ‘presto-cluster’), you should go in and add a node group which has sufficient capacity for the Presto Docker containers to run on. I plan on using R5.2xlarge nodes. I setup a node group of 4 nodes (You can tweak your Presto Docker container settings accordingly and use smaller nodes if required).
Figure 1: Creating a new nodegroup
Figure 2: Setting the instance type and node count
Once your node group shows active you are ready to move onto the next step
Figure 3: Make sure your node group is successfully created and is active
Tinkering with the Docker containers locally
Let’s first make sure the Docker container we are going to use with Kubernetes is working as desired. If you would like to review the Docker file, the scripts and environment variable supported the repository can be found here.
The details of the specific configuration parameters being used to customize the container behavior can be found in the entrypoint.sh script. You can override any of the default values by providing the values via –env option for docker or by using name-value pairs in the Kubernetes yaml file as we will see later.
You need the following:
A user and their Access Key and Secret Access Key for Glue and S3 (You can use the same or different user):
arn:aws:iam::<your account id>:user/<your user>
A role which the user above can assume to access Glue and S3:
arn:aws:iam::<your account id>:role/<your role>
Figure 4: Assume role privileges
Figure 5: Trust relationships
Access to the latest docker image for this tutorial asifkazi/presto-on-docker:latest
Warning: The permissions provided above are pretty lax, giving the user a lot of privileges not just on assume role but also what operations the user can perform on S3 and Glue. DO NOT use these permissions as-is for production use. It’s highly recommended to tie down the privileges using the principle of least privilege (only provide the minimal access required)
From within the docker container we will verify that everything is working correctly:
Run the following command:
presto
From within the presto cli run the following:
show schemas from mysql
The command should show the mysql databases
From within the presto cli run the following:
show schemas from hive
The command should show the databases from glue. If you are using glue for the first time you might only see the information_schema and default database.
We have validated that the docker container itself is working fine, as a single node cluster (worker and coordinator on the same node). We will not move to getting this environment now working in Kubernetes. But first, let’s clean up.
Run the following command to stop and cleanup your docker instances locally.
docker stop mysql presto;docker rm mysql presto;
How to get started runningPresto on Kubernetes
To get presto running on K8s, we will configure the deployment declaratively using YAML files. In addition to Kubernetes specific properties, we will provide all the docker env properties via name value pairs.
Create a namespace for the presto cluster
kubectl create namespace presto
Override the env settings in the presto.yaml file for both the coordinator and worker sections
Apply the yaml file to the Kubernetes cluster
kubectl apply -f presto.yaml –namespace presto
Let’s also start a mysql instance. We will first start by creating a persistent volume and claim.
Check the status of the cluster make sure there are no errored or failing pods
kubectl get pods -n presto
Log into the container and repeat the verification steps for mysql and Hive that we executed for docker. You are going to need the pod name for the coordinator from the command above.
Note: the space between the — and bash is required
Querying seems to be working but is the Kubernetes deployment a multi-node cluster? Let’s check:
select node,vmname,vmversion from jmx.current."java.lang:type=runtime";
Let’s see what happens if we destroy one of the pods (simulate failure)
kubectl delete pod presto-worker-k9xw8 -n presto
What does the current deployment look like?
What? The pod was replaced by a new one presto-worker-tnbsb!
Now we’ll modify the number of replicas for the workers in the presto.yaml
Set replicas to 4
Apply the changes to the cluster
kubectl apply -f presto.yaml –namespace presto
Check the number of running pods for the workers
kubectl get pods -n presto
Wow, we have a fully functional presto cluster running! Imagine setting this up manually and tweaking all the configurations yourself, in addition to managing the availability and resiliency.
Summary
In this tutorial we setup a single node Presto cluster on Docker and then deployed the same image to Kubernetes. By taking advantage of the Kubernetes configuration files and constructs, we were able to scale out the Presto cluster to our needs as well as demonstrate resiliency by forcefully killing off a pod.
Kubernetes and Presto, better together. You can run large scale deployments of one or more Presto clusters with ease.
Announcing the Cube integration with Ahana: Querying multiple data sources with managed Presto and Cube
See how Ahana and Cube work together to help you set up a Presto cluster and build a single source of truth for metrics without spending days reading cryptic docs
Ahana provides managed Presto clusters running in your AWS account.
Presto is an open-source distributed SQL query engine, originally developed at Facebook, now hosted under the Linux Foundation. It connects to multiple databases or other data sources (for example, Amazon S3). We can use a Presto cluster as a single compute engine for an entire data lake.
Presto implements the data federation feature: you can process data from multiple sources as if they were stored in a single database. Because of that, you don’t need a separate ETL (Extract-Transform-Load) pipeline to prepare the data before using it. However, running and configuring a single-point-of-access for multiple databases (or file systems) requires Ops skills and an additional effort.
However, no data engineer wants to do the Ops work. Using Ahana, you can deploy a Presto cluster within minutes without spending hours configuring the service, VPCs, and AWS access rights. Ahana hides the burden of infrastructure management and allows you to focus on processing your data.
What is Cube?
Cube is a headless BI platform for accessing, organizing, and delivering data. Cube connects to many data warehouses, databases, or query engines, including Presto, and allows you to quickly build data applications or analyze your data in BI tools. It serves as the single source of truth for your business metrics.
This article will demonstrate the caching functionality, access control, and flexibility of the data retrieval API.
Integration
Cube’s battle-tested Presto driver provides the out-of-the-box connectivity to Ahana.
You just need to provide the credentials: Presto host name and port, user name and password, Presto catalog and schema. You’ll also need to set CUBEJS_DB_SSL to true since Ahana has secures Presto connections with SSL.
Check the docs to learn more about connecting Cube to Ahana.
Example: Parsing logs from multiple data sources with Ahana and Cube
Let’s build a real-world data application with Ahana and Cube.
We will use Ahana to join Amazon Sagemaker Endpoint logs stored as JSON files in S3 with the data retrieved from a PostgreSQL database.
Suppose you work at a software house specializing in training ML models for your clients and delivering ML inference as a REST API. You have just trained new versions of all models, and you would like to demonstrate the improvements to the clients.
Because of that, you do a canary deployment of the versions and gather the predictions from the new and the old models using the built-in logging functionality of AWS Sagemaker Endpoints: a managed deployment environment for machine learning models. Additionally, you also track the actual production values provided by your clients.
You need all of that to prepare personalized dashboards showing the results of your hard work.
Let us show you how Ahana and Cube work together to help you achieve your goal quickly without spending days reading cryptic documentation.
You will retrieve the prediction logs from an S3 bucket and merge them with the actual values stored in a PostgreSQL database. After that, you calculate the ML performance metrics, implement access control, and hide the data source complexity behind an easy-to-use REST API.
In the end, you want a dashboard looking like this:
The final result: two dashboards showing the number of errors made by two variants of the ML model
How to configure Ahana?
Allowing Ahana to access your AWS account
First, let’s login to Ahana, and connect it to your AWS account. We must create an IAM role allowing Ahana to access our AWS account.
On the setup page, click the “Open CloudFormation” button. After clicking the button, we get redirected to the AWS page for creating a new CloudFormation stack from a template provided by Ahana. Create the stack and wait until CloudFormation finishes the setup.
When the IAM role is configured, click the stack’s Outputs tab and copy the AhanaCloudProvisioningRole key value.
We have to paste it into the Role ARN field on the Ahana setup page and click the “Complete Setup” button.
Creating an Ahana cluster
After configuring AWS access, we have to start a new Ahana cluster.
In the Ahana dashboard, click the “Create new cluster” button.
In the setup window, we can configure the type of the AWS EC2 instances used by the cluster, scaling strategy, and the Hive Metastore. If you need a detailed description of the configuration options, look at the “Create new cluster” section of the Ahana documentation.
Remember to add at least one user to your cluster! When we are satisfied with the configuration, we can click the “Create cluster” button. Ahana needs around 20-30 minutes to setup a new cluster.
Retrieving data from S3 and PostgreSQL
After deploying a Presto cluster, we have to connect our data sources to the cluster because, in this example, the Sagemaker Endpoint logs are stored in S3 and PostgreSQL.
Adding a PostgreSQL database to Ahana
In the Ahana dashboard, click the “Add new data source” button. We will see a page showing all supported data sources. Let’s click the “Amazon RDS for PostgreSQL” option.
In the setup form displayed below, we have to provide the database configuration and click the “Add data source” button.
Adding an S3 bucket to Ahana
AWS Sagemaker Endpoint stores their logs in an S3 bucket as JSON files. To access those files in Presto, we need to configure the AWS Glue data catalog and add the data catalog to the Ahana cluster.
We have to login to the AWS console, open the AWS Glue page and add a new database to the data catalog (or use an existing one).
Now, let’s add a new table. We won’t configure it manually. Instead, let’s create a Glue crawler to generate the table definition automatically. On the AWS Glue page, we have to click the “Crawlers” link and click the “Add crawler” button.
After typing the crawler’s name and clicking the “Next” button, we will see the Source Type page. On this page, we have to choose the”Data stores” and “Crawl all folders” (in our case, “Crawl new folders only” would work too).
Here we specify where the crawler should look for new data
On the “Data store” page, we pick the S3 data store, select the S3 connection (or click the “Add connection” button if we don’t have an S3 connection configured yet), and specify the S3 path.
Note that Sagemaker Endpoints store logs in subkeys using the following key structure: endpoint-name/model-variant/year/month/day/hour. We want to use those parts of the key as table partitions.
Because of that, if our Sagemaker logs have an S3 key: s3://the_bucket_name/sagemaker/logs/endpoint-name/model-variant-name/year/month/day/hour, we put only the s3://the_bucket_name/sagemaker/logs key prefix in the setup window!
Let’s click the “Next” button. In the subsequent window, we choose “No” when asked whether we want to configure another data source. Glue setup will ask about the name of the crawler’s IAM role. We can create a new one:
Next, we configure the crawler’s schedule. A Sagemaker Endpoint adds new log files in near real-time. Because of that, it makes sense to scan the files and add new partitions every hour:
In the output configuration, we need to customize the settings.
First, let’s select the Glue database where the new tables get stored. After that, we modify the “Configuration options.”
We pick the “Add new columns only” because we will make manual changes in the table definition, and we don’t want the crawler to overwrite them. Also, we want to add new partitions to the table, so we check the “Update all new and existing partitions with metadata from the table.” box.
Let’s click “Next.” We can check the configuration one more time in the review window and click the “Finish” button.
Now, we can wait until the crawler runs or open the AWS Glue Crawlers view and trigger the run manually. When the crawler finishes running, we go to the Tables view in AWS Glue and click the table name.
In the table view, we click the “Edit table” button and change the “Serde serialization lib” to “org.apache.hive.hcatalog.data.JsonSerDe” because the AWS JSON serialization library isn’t available in the Ahana Presto cluster.
We should also click the “Edit schema” button and change the default partition names to values shown in the screenshot below:
After saving the changes, we can add the Glue data catalog to our Ahana Presto cluster.
Configuring data sources in the Presto cluster
Go back to the Ahana dashboard and click the “Add data source” button. Select the “AWS Glue Data Catalog for Amazon S3” option in the setup form.
Let’s select our AWS region and put the AWS account id in the “Glue Data Catalog ID” field. After that, we click the “Open CloudFormation” button and apply the template. We will have to wait until CloudFormation creates the IAM role.
When the role is ready, we copy the role ARN from the Outputs tab and paste it into the “Glue/S3 Role ARN” field:
The “Outputs” tab shows the ARN of the IAM role used to access the Glue data catalog from Ahana
On the Ahana setup page, we click the “Add data source” button.
Adding data sources to an existing cluster
Finally, we can add both data sources to our Ahana cluster.
We have to open the Ahana “Clusters” page, click the “Manage” button, and scroll down to the “Data Sources‚” section. In this section, we click the “Manage data sources” button.
We will see another setup page where we check the boxes next to the data sources we want to configure and click the “Modify cluster” button. We will need to confirm that we want to restart the cluster to make the changes.
Writing the Presto queries
The actual structure of the input and output from an AWS Sagemaker Endpoint depends on us. We can send any JSON request and return a custom JSON object.
Let’s assume that our endpoint receives a request containing the input data for the machine learning model and a correlation id. We will need those ids to join the model predictions with the actual data.
AWS Sagemaker Endpoints encode the request and response using base64. Our query needs to decode the data before we can process it. Because of that, our Presto query starts with data decoding:
with sagemaker as (
select
model_name,
variant_name,
cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.data)), '$.correlation_id') as varchar) as correlation_id,
cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.data)), '$.prediction') as double) as prediction
from s3.sagemaker_logs.logs
)
, actual as (
select correlation_id, actual_value
from postgresql.public.actual_values
)
After that, we join both data sources and calculate the absolute error value:
sql
, logs as (
select model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as actual
from sagemaker
left outer join actual
on sagemaker.correlation_id = actual.correlation_id
)
, errors as (
select abs(prediction - actual) as abs_err, model_name, model_variant from logs
),
Now, we need to calculate the percentiles using the `approx_percentile` function. Note that we group the percentiles by model name and model variant. Because of that, Presto will produce only a single row per every model-variant pair. That’ll be important when we write the second part of this query.
percentiles as (
select approx_percentile(abs_err, 0.1) as perc_10,
approx_percentile(abs_err, 0.2) as perc_20,
approx_percentile(abs_err, 0.3) as perc_30,
approx_percentile(abs_err, 0.4) as perc_40,
approx_percentile(abs_err, 0.5) as perc_50,
approx_percentile(abs_err, 0.6) as perc_60,
approx_percentile(abs_err, 0.7) as perc_70,
approx_percentile(abs_err, 0.8) as perc_80,
approx_percentile(abs_err, 0.9) as perc_90,
approx_percentile(abs_err, 1.0) as perc_100,
model_name,
model_variant
from errors
group by model_name, model_variant
)
In the final part of the query, we will use the filter expression to count the number of values within buckets. Additionally, we return the bucket boundaries. We need to use an aggregate function max (or any other aggregate function) because of the group by clause. That won’t affect the result because we returned a single row per every model-variant pair in the previous query.
SELECT count(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10
, max(perc_10) as perc_10_value
, count(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20
, max(perc_20) as perc_20_value
, count(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30
, max(perc_30) as perc_30_value
, count(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40
, max(perc_40) as perc_40_value
, count(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50
, max(perc_50) as perc_50_value
, count(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60
, max(perc_60) as perc_60_value
, count(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70
, max(perc_70) as perc_70_value
, count(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80
, max(perc_80) as perc_80_value
, count(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90
, max(perc_90) as perc_90_value
, count(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100
, max(perc_100) as perc_100_value
, p.model_name, p.model_variant
FROM percentiles p, errors e group by p.model_name, p.model_variant
How to configure Cube?
In our application, we want to display the distribution of absolute prediction errors.
We will have a chart showing the difference between the actual value and the model’s prediction. Our chart will split the absolute errors into buckets (percentiles) and display the number of errors within every bucket.
If the new variant of the model performs better than the existing model, we should see fewer large errors in the charts. A perfect (and unrealistic) model would produce a single error bar in the left-most part of the chart with the “0” label.
At the beginning of the article, we looked at an example chart that shows no significant difference between both model variants:
If the variant B were better than the variant A, its chart could look like this (note the axis values in both pictures
Creating a Cube deployment
Cube Cloud is the easiest way to get started with Cube. It provides a fully managed, ready to use Cube cluster. However, if you prefer self-hosting, then follow this tutorial.
First, please create a new Cube Cloud deployment. Then, open the “Deployments” page and click the “Create deployment” button.
We choose the Presto cluster:
Finally, we fill out the connection parameters and click the “Apply”button. Remember to enable the SSL connection!
Defining the data model in Cube
We have our queries ready to copy-paste, and we have configured a Presto connection in Cube. Now, we can define the Cube schema to retrieve query results.
Let’s open the Schema view in Cube and add a new file.
In the next window, type the file name errorpercentiles.js and click “Create file.”
In the following paragraphs, we will explain parts of the configuration and show you code fragments to copy-paste. You don’t have to do that in such small steps!
Below, you see the entire content of the file. Later, we explain the configuration parameters.
const measureNames = [
'perc_10', 'perc_10_value',
'perc_20', 'perc_20_value',
'perc_30', 'perc_30_value',
'perc_40', 'perc_40_value',
'perc_50', 'perc_50_value',
'perc_60', 'perc_60_value',
'perc_70', 'perc_70_value',
'perc_80', 'perc_80_value',
'perc_90', 'perc_90_value',
'perc_100', 'perc_100_value',
];
const measures = Object.keys(measureNames).reduce((result, name) => {
const sqlName = measureNames[name];
return {
...result,
[sqlName]: {
sql: () => sqlName,
type: `max`
}
};
}, {});
cube('errorpercentiles', {
sql: `with sagemaker as (
select
model_name,
variant_name,
cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.data)), '$.correlation_id') as varchar) as correlation_id,
cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.data)), '$.prediction') as double) as prediction
from s3.sagemaker_logs.logs
)
, actual as (
select correlation_id, actual_value
from postgresql.public.actual_values
)
, logs as (
select model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as actual
from sagemaker
left outer join actual
on sagemaker.correlation_id = actual.correlation_id
)
, errors as (
select abs(prediction - actual) as abs_err, model_name, model_variant from logs
),
percentiles as (
select approx_percentile(abs_err, 0.1) as perc_10,
approx_percentile(abs_err, 0.2) as perc_20,
approx_percentile(abs_err, 0.3) as perc_30,
approx_percentile(abs_err, 0.4) as perc_40,
approx_percentile(abs_err, 0.5) as perc_50,
approx_percentile(abs_err, 0.6) as perc_60,
approx_percentile(abs_err, 0.7) as perc_70,
approx_percentile(abs_err, 0.8) as perc_80,
approx_percentile(abs_err, 0.9) as perc_90,
approx_percentile(abs_err, 1.0) as perc_100,
model_name,
model_variant
from errors
group by model_name, model_variant
)
SELECT count(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10
, max(perc_10) as perc_10_value
, count(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20
, max(perc_20) as perc_20_value
, count(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30
, max(perc_30) as perc_30_value
, count(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40
, max(perc_40) as perc_40_value
, count(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50
, max(perc_50) as perc_50_value
, count(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60
, max(perc_60) as perc_60_value
, count(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70
, max(perc_70) as perc_70_value
, count(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80
, max(perc_80) as perc_80_value
, count(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90
, max(perc_90) as perc_90_value
, count(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100
, max(perc_100) as perc_100_value
, p.model_name, p.model_variant
FROM percentiles p, errors e group by p.model_name, p.model_variant`,
preAggregations: {
// Pre-Aggregations definitions go here
// Learn more here: https://cube.dev/docs/caching/pre-aggregations/getting-started
},
joins: {
},
measures: measures,
dimensions: {
modelVariant: {
sql: `model_variant`,
type: 'string'
},
modelName: {
sql: `model_name`,
type: 'string'
},
}
});
In the sql property, we put the query prepared earlier. Note that your query MUST NOT contain a semicolon.
We will group and filter the values by the model and variant names, so we put those columns in the dimensions section of the cube configuration. The rest of the columns are going to be our measurements. We can write them out one by one like this:
The notation we have shown you has lots of repetition and is quite verbose. We can shorten the measurements defined in the code by using JavaScript to generate them.
We had to add the following code before using the cube function!
First, we have to create an array of column names:
Finally, we can replace the measure definitions with:
measures: measures
After changing the file content, click the “Save All” button.
And click the Continue button in the popup window.
In the Playground view, we can test our query by retrieving the chart data as a table (or one of the built-in charts):
Configuring access control in Cube
In the Schema view, open the cube.js file.
We will use the queryRewrite configuration option to allow or disallow access to data.
First, we will reject all API calls without the models field in the securityContext. We will put the identifier of the models the user is allowed to see in their JWT token. The security context contains all of the JWT token variables.
For example, we can send a JWT token with the following payload. Of course, in the application sending queries to Cube, we must check the user’s access right and set the appropriate token payload. Authentication and authorization are beyond the scope of this tutorial, but please don’t forget about them.
After rejecting unauthorized access, we add a filter to all queries.
We can distinguish between the datasets accessed by the user by looking at the data specified in the query. We need to do it because we must filter by the modelName property of the correct table.
In our queryRewrite configuration in the cube.js file, we use the query.filter.push function to add a modelName IN (model_1, model_2, ...) clause to the SQL query:
module.exports = {
queryRewrite: (query, { securityContext }) => {
if (!securityContext.models) {
throw new Error('No models found in Security Context!');
}
query.filters.push({
member: 'percentiles.modelName',
operator: 'in',
values: securityContext.models,
});
return query;
},
};
Configuring caching in Cube
By default, Cube caches all Presto queries for 2 minutes. Even though Sagemaker Endpoints stores logs in S3 in near real-time, we aren’t interested in refreshing the data so often. Sagemaker Endpoints store the logs in JSON files, so retrieving the metrics requires a full scan of all files in the S3 bucket.
When we gather logs over a long time, the query may take some time. Below, we will show you how to configure the caching in Cube. We recommend doing it when the end-user application needs over one second to load the data.
For the sake of the example, we will retrieve the value only twice a day.
Preparing data sources for caching
First, we must allow Presto to store data in both PostgreSQL and S3. It’s required because, in the case of Presto, Cube supports only the simple pre-aggregation strategy. Therefore, we need to pre-aggregate the data in the source databases before loading them into Cube.
In PostgreSQL, we grant permissions to the user account used by Presto to access the database:
GRANT CREATE ON SCHEMA the_schema_we_use TO the_user_used_in_presto;
GRANT USAGE ON SCHEMA the_schema_we_use TO the_user_used_in_presto;
If we haven’t modified anything in the AWS Glue data catalog, Presto already has permission to create new tables and store their data in S3, but the schema doesn’t contain the target S3 location yet, so all requests will fail.
We must login to AWS Console, open the Glue data catalog, and create a new database called prod_pre_aggregations. In the database configuration, we must specify the S3 location for the table content.
If you want to use a different database name, follow the instructions in our documentation.
Caching configuration in Cube
Let’s open the errorpercentiles.js schema file. Below the SQL query, we put the preAggregations configuration:
After testing the development version, we can also deploy the changes to production using the “Commit & Push”button. When we click it, we will be asked to type the commit message:
When we commit the changes, the deployment of a new version of the endpoint will start. A few minutes later, we can start sending queries to the endpoint.
We can also check the pre-aggregations window to verify whether Cube successfully created the cached data.
Now, we can move to the Playground tab and run our query. We should see the “Query was accelerated with pre-aggregation”message if Cube used the cached values to handle the request.
Building the front-end application
Cube can connect to a variety of tools, including Jupyter Notebooks, Superset, and Hex. However, we want a fully customizable dashboard, so we will build a front-end application.
Our dashboard consists of two parts: the website and the back-end service. In the web part, we will have only the code required to display the charts. In the back-end, we will handle authentication and authorization. The backend service will also send requests to the Cube REST API.
Getting the Cube API key and the API URL
Before we start, we have to copy the Cube API secret. Open the settings page in Cube Cloud’s web UI and click the “Env vars”tab. In the tab, you will see all of the Cube configuration variables. Click the eye icon next to the CUBEJS_API_SECRET and copy the value.
We also need the URL of the Cube endpoint. To get this value, click the “Copy API URL” link in the top right corner of the screen.
Back end for front end
Now, we can write the back-end code.
First, we have to authenticate the user. We assume that you have an authentication service that verifies whether the user has access to your dashboard and which models they can access. In our examples, we expect those model names in an array stored in the allowedModels variable.
After getting the user’s credentials, we have to generate a JWT to authenticate Cube requests. Note that we have also defined a variable for storing the CUBE_URL. Put the URL retrieved in the previous step as its value.
​​const jwt = require('jsonwebtoken');
CUBE_URL = '';
function create_cube_token() {
const CUBE_API_SECRET = your_token; // Don’t store it in the code!!!
// Pass it as an environment variable at runtime or use the
// secret management feature of your container orchestration system
const cubejsToken = jwt.sign(
{ "models": allowedModels },
CUBE_API_SECRET,
{ expiresIn: '30d' }
);
return cubejsToken;
}
We will need two endpoints in our back-end service: the endpoint returning the chart data and the endpoint retrieving the names of models and variants we can access.
We create a new express application running in the node server and configure the /models endpoint:
const request = require('request');
const express = require('express')
const bodyParser = require('body-parser')
const port = 5000;
const app = express()
app.use(bodyParser.json())
app.get('/models', getAvailableModels);
app.listen(port, () => {
console.log(`Server is running on port ${port}`)
})
In the getAvailableModels function, we query the Cube Cloud API to get the model names and variants. It will return only the models we are allowed to see because we have configured the Cube security context:
Our function returns a list of objects containing the modelName and modelVariant fields.
Let’s retrieve the percentiles and percentile buckets. To simplify the example, we will show only the query and the response parsing code. The rest of the code stays the same as in the previous endpoint.
The query specifies all measures we want to retrieve and sets the filter to get data belonging to a single model’s variant. We could retrieve all data at once, but we do it one by one for every variant.
In the last step, we build the dashboard website using Vue.js.
If you are interested in copy-pasting working code, we have prepared the entire example in a CodeSandbox. Below, we explain the building blocks of our application.
We define the main Vue component encapsulating the entire website content. In the script section, we will download the model and variant names. In the template, we iterate over the retrieved models and generate a chart for all of them.
We put the charts in the Suspense component to allow asynchronous loading.
To keep the example short, we will skip the CSS style part.
The OwnerName component displays our client’s name. We will skip its code as it’s irrelevant in our example.
In the ChartView component, we use the vue-chartjs library to display the charts. Our setup script contains the required imports and registers the Chart.js components:
Äãimport { Bar } from 'vue-chartjs'
import { Chart as ChartJS, Title, Tooltip, Legend, BarElement, CategoryScale, LinearScale } from 'chart.js'
import { ref } from 'vue'
import axios from 'axios'
ChartJS.register(Title, Tooltip, Legend, BarElement, CategoryScale, LinearScale);
We have bound the title, variant, and chart type to the ChartView instance. Therefore, our component definition must contain those properties:
If we have done everything correctly, we should see a dashboard page with error distributions.
Wrapping up
Thanks for following this tutorial.
We encourage you to spend some time reading the Cube and Ahana documentation.
Please don’t hesitate to like and bookmark this post, write a comment, give Cube a star on GitHub, join Cube’s Slack community, and subscribe to the Ahana newsletter.
AWS Athena Limitations
Amazon Athena is a useful query tool – but sometimes you need more control over price, performance, and scale. Ahana runs on the same powerful underlying technology but gives you back control – so you can scale your data lake analytics without exploding your cloud bill. Learn more or request a demo today
Welcome to our blog series on comparing AWS Athena, a serverless Presto service, to open source PrestoDB. In this series we’ll discuss Amazon’s Athena service versus PrestoDB. We’ll also discuss some of the reasons why you’d choose to deploy PrestoDB on yourself, rather than using the AWS Athena service. We hope you find this series helpful.
AWS Athena is an interactive query service built on PrestoDB that developers use to query data stored in Amazon S3 using standard SQL. Athena has a serverless architecture, which is a benefit. However, one of the drawbacks is the cost of AWS Athena. Currently, users pay per query. Currently it’s priced at $5 per terabyte scanned. Some of the common Amazon Athena limits are technical limitations that include query limits, concurrent queries limits, and partition limits. AWS Athena limits performance, as it runs slowly and increases operational costs. In addition to this, AWS Athena is built on an older version of PrestoDB and it only supports a subset of the PrestoDB features.
An overview on AWS Athena limits
AWS Athena query limits can cause problems, and many data engineering teams have spent hours trying to diagnose them. Most of the limitations associated with Athena are rather challenging. Luckily, some are soft quotas. With these, you can request AWS to increase them. One big issue is around Athena’s restrictions on queries: Athena users can only submit one query at a time and can only run up to five queries simultaneously for each account by default.
AWS Athena query limits
AWS Athena Data Definition Language (DDL, like CREATE TABLE statements) and Data Manipulation Language (DML, like DELETE and INSERT) have the following limits:
1. Athena DDL max query limit: 20 DDL active queries .
2. Athena DDL query timeout limit: The Athena DDL query timeout is 600 minutes.
3. Athena DML query limit: Athena only allows you to have 25 DML queries (running and queued queries) in the US East and 20 DML queries in all other Regions by default.
4. Athena DML query timeout limit: The Athena DML query timeout limit is 30 minutes.
5. Athena query string length limit: The Athena query string hard limit is 262,144 bytes.
Athena’s users can use AWS Glue, a data catalog and ETL service. Athena’s partition limit is 20,000 per table and Glue’s limit is 1,000,000 partitions per table.
A Create Table As (CTAS) or INSERT INTO query can only create up to 100 partitions in a destination table. To work around this limitation you must manually chop up your data by running a series of INSERT INTOs that insert up to 100 partitions each.
Athena database limits
AWS Athena also has the following S3 bucket limitations:
1. Amazon S3 bucket limit is 100* buckets per account by default – you can request to increase it up to 1,000 S3 buckets per account.
2. Athena restricts each account to 100* databases, and databases cannot include over 100* tables.
*Note, recently Athena has increased this to 10K databases per account and 200K tables per database.
Summary: Athena DB limits:
Amazon S3 bucket limit
1k buckets per account
Database limit
10K databases per account
Tables per database
200k
AWS Athena open-source alternative
Deploying your own PrestoDB cluster
An Amazon Athena alternative is deploying your own PrestoDB cluster. Amazon Athena is built on an old version of PrestoDB – in fact, it’s about 60 releases behind the PrestoDB project. Newer features are likely to be missing from Athena (and in fact it only supports a subset of PrestoDB features to begin with).
Deploying and managing PrestoDB on your own means you won’t have AWS Athena limitations such as the athena concurrent queries limit, concurrent queries limits, database limits, table limits, partitions limits, etc. Plus you’ll get the very latest version of Presto. PrestoDB is an open source project hosted by The Linux Foundation’s Presto Foundation. It has a transparent, open, and neutral community.
If deploying and managing PrestoDB on your own is not an option (time, resources, expertise, etc.), Ahana can help.
Ahana Cloud for Presto: A fully managed service
Ahana Cloud for Presto is a fully managed Presto cloud service, without the limitations of AWS Athena.
You use AWS to query and analyze AWS data lakes stored in Amazon S3, and many other data sources, using the latest version of PrestoDB. Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, speed and ease of use. Ahana is a SaaS offering via a beautiful and easy to use console UI. Anyone at any knowledge level can use it with ease, there is zero configuration effort and no configuration files to manage. Many companies have moved from AWS Athena to Ahana Cloud.
Discover the 4 most popular choices to replace Amazon Athena.
Building an Open Data Lakehouse with Presto, Hudi and AWS S3
The Open Data Lakehouse – a quick intro
Understanding the necessity of building a Data Lakehouse is critical to today’s data landscape. If you’re looking to get started with constructing a data lakehouse analytics stack book time with an engineer to expedite the development process.
Data warehouses have been considered a standard to perform analytics on structured data but cannot handle unstructured data such as text, images, audio, video and other formats. Additionally, machine learning and AI are becoming common in every aspect of business and they need access to vast amounts of data outside of data warehouses.
The cloud transformation has triggered the disaggregation of compute and storage which leverages cost benefits and enables adaptability to store data coming from multiple dimensions. All this has led to a new data platform architecture called the OpenData Lakehouse. It solves the challenges of the traditional cloud data warehouse through its use of open source and open format technologies such as Presto and Hudi. In this blog you will learn more about open data lake analytics stack using Presto, Hudi, and AWS S3.
What is an Open Data Lakehouse
The Open Data Lakehouse is based on the concept of bringing your warehouse workloads to the data lake. You can run analytics on technology and tools that do not require any vendor lock-in including licensing, data formats, interfaces and infrastructure.
Four key elements include:
Open source – The technologies on the stack we will be exploring for Open Data Lake Analytics are completely open source under the Apache 2.0 license. This means that you benefit from the best innovations, not just from one vendor but from the entire community.
Open formats – Also they don’t use any proprietary formats. In fact, it supports most of the common formats like JSON, Apache ORC, Apache Parquet and others.
Open interfaces – The interfaces are industry standard ANSI SQL compatible and standard JDBC / ODBC drivers can be used to connect to any reporting / dashboarding / notebook tool. And because it is open source, industry standard language clauses continue to be added in and expanded on.
Open cloud – The stack is cloud agnostic and without storage natively aligns with containers and can be run on any cloud.
Why Open Data Lakehouses?
Open data lakehouses allow consolidation of structured and unstructured data in a central repository, the open data lake, at cheaper cost and removes the complexity of running ETL, resulting in high performance and reducing cost and time to run analytics.
Bringing compute to your data (decouple of compute and storage)
Flexibility at the governance/transaction layer
Flexibility and low cost to store structured and semi/unstructured data
Flexibility at every layer – pick and choose which technology works best for your workloads/use case
Open Data Lakehouse architecture
Now let’s dive into the stack itself and each of the layers. We’ll discuss what problems each layer solves for.
BI/Application tools – Data Visualization, Data Science tools
Plug in your BI/analytical application tool of choice. The Open Data Lake Analytics stack supports the use of JDBC/ODBC drivers so you can connect Tableau, Looker, preset, jupyter notebook, etc. based on your use case and workload.
Presto – SQL Query Engine for the Data Lake
Presto is a parallel distributed SQL query engine for the data lake. It enables interactive, ad-hoc analytics on large amounts of data on data lakes. With Presto you can query data where it lives, including data sources like AWS S3, relational databases, NoSQL databases, and some proprietary data stores.
Presto is built for high performance interactive querying with in-memory execution
Key characteristics include:
High scalability from 1 to 1000s of workers
Flexibility to support a wide range of SQL use cases
Highly pluggable architecture that makes it easy to extend Presto with custom integrations for security, event listeners, etc.
Federation of data sources particularly data lakes via Presto connectors
Seamless integration with existing SQL systems with ANSI SQL standard
A full deployment of Presto has a coordinator and multiple workers. Queries are sub‐ mitted to the coordinator by a client like the command line interface (CLI), a BI tool, or a notebook that supports SQL. The coordinator parses, analyzes and creates the optimal query execution plan using metadata and data distribution information. That plan is then distributed to the workers for processing. The advantage of this decoupled storage model is that Presto is able to provide a single view of all of your data that has been aggregated into the data storage tier like S3.
Apache Hudi – Streaming Transactions in the Open Data Lake
One of the big drawbacks in traditional data warehouses is keeping the data updated. It requires building data mart/cubes then doing constant ETL from source to destination mart, resulting in additional time, cost and duplication of data. Similarly, data in the data lake needs to be updated and consistent without that operational overhead.
A transactional layer in your Open Data Lake Analytics stack is critical, especially as data volumes grow and the frequency at which data must be updated continues to increase. Using a technology like Apache Hudi solves for the following:
Ingesting incremental data
Changing data capture, both insert and deletion
Incremental data processing
ACID transactions
Apache Hudi, which stands for Hadoop Upserts Deletes Incrementals, is an open-source based transaction layer with storage abstraction for analytics developed by Uber. In short, Hudi enables atomicity, consistency, isolation, and durability (ACID) transactions in a data lake. Hudi uses open file formats Parquet and Avro for data storage and internal table formats known as Copy-On-Write and Merge-On-Read.
It has built-in integration with Presto so you can query “hudi datasets” stored on the open file formats.
Hudi Data Management
Hudi has a table format which is based on directory structure and the table will have partitions, which are folders containing data files for that partition. It has indexing capabilities to support fast upserts. Hudi has two table types defining how data is indexed and layed out which defines how the underlying data is exposed to queries.
(Image source: Apache Hudi)
Copy-On-Write (COW): Data is stored in Parquet file format (columnar storage), and each new update creates a new version of files during a write. Updating an existing set of rows will result in a rewrite of the entire parquet files for the rows being updated.
Merge-On-Read (MOR): Data is stored in a combination of Parquet file format (columnar) and Avro (row-based) file formats. Updates are logged to row-based delta files until compaction, which will produce new versions of the columnar files.
Based on the two table types Hudi provides three logical views for querying data from the Data Lake.
Read-optimized – Queries see the latest committed dataset from CoW tables and the latest compacted dataset from MoR tables
Incremental – Queries see new data written to the table after a commit/compaction. This helps to build incremental data pipelines and it’s analytics.
Real-time – Provides the latest committed data from a MoR table by merging the columnar and row-based files inline
AWS S3 – The Data Lake
The data lake is the central location for storing data from disparate sources such as structured, semi-structured and unstructured data and in open formats on object storage such as AWS S3.
Amazon Simple Storage Service (Amazon S3) is the de facto centralized storage to implement Open Data Lake Analytics.
Getting Started: How to run Open data lake analytics workloads using Presto to query Apache Hudi datasets on S3
Now that you know the details of this stack, it’s time to get started. Here I’ll quickly show how you can actually use Presto to query your Hudi datasets on S3.
Data can be ingested on Data lake from different sources such as kafka and other databases, by introducing hudi into the data pipeline the needed Hudi tables will be created/updated and the data will be stored in either Parquet or Avro format based on the table type in S3 Data Lake. Later BI Tools/Application can query data using Presto which will reflect updated results as data gets updated.
Conclusion:
The Open Data Lake Analytics stack is becoming more widely used because of its simplicity, flexibility, performance and cost.
The technologies that make up that stack are critical. Presto, being the de-facto SQL query engine for the data lake, along with the transactional support and change data capture capabilities of Hudi, make it a strong open source and open format solution for data lake analytics but a missing component is Data Lake Governance which allows to run queries on S3 more securely. AWS has recently introduced Lake formation, a data governance solution for data lake and Ahana, a managed service for Presto seamlessly integrates Presto with AWS lake formation to run interactive queries on your AWS S3 data lakes with fine grained access to data.
AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach.
PrestoDB on AWS
What isPrestoDB on AWS?
Tip: If you are looking to better understand PrestoDB on AWS then check out the free, downloadable ebook, Learning and Operating Presto. This ebook will breakdown what Presto is, how it started, and best use cases.
To tackle this common question, what is PrestoDB on AWS, let’s first define Presto. PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. Presto was originally developed by Facebook and later donated to the Linux Foundation’s Presto Foundation. It was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.
Presto enables self-service ad-hoc analytics for its users on large amounts of data. With Presto, you can query data where it lives. This is including Hive, Amazon S3, Hadoop, Cassandra, relational databases, NoSQL databases, or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.
AWS and Presto is a powerful combination. If you want to run PrestoDB on AWS it’s easy to spin up a managed Presto cluster. This can be done either through the Amazon Management Console, the AWS CLI, or the Amazon EMR API. It is not too difficult to run AWS Presto CLI EMR.
You can also give Ahana Cloud a try. Ahana is a managed service for Presto that takes care of the devops for you and provides everything you need to build your SQL Data Lakehouse using Presto.
Running Presto on AWS gives you the flexibility, scalability, performance, and cost-effective features of the cloud while allowing you to take advantage of Presto’s distributed query engine.
How does PrestoDB on AWS Work?
This is another very common question. The quickest answer is that PrestoDB is the compute engine on top of the data storage of your SQL Data Lakehouse. In this case, the storage is AWS S3. See the image below for an overview.
There are some AWS services that work with PrestoDB on AWS, like Amazon EMR and Amazon Athena. Amazon EMR and Amazon Athena are the best Amazon services to deploy Presto in the cloud. They are managed services that do the integration, testing, setup, configuration, and cluster tuning for you. Amazon Athena Presto and EMR are widely used, but both come with some challenges, such as price performance and cost.
There are some differences when it comes to EMR Presto vs Athena. AWS EMR enables you to provision as many compute instances as you want, and within minutes. Amazon Athena lets you deploy Presto using the AWS Serverless platform, with no servers, virtual machines, or clusters to setup, manage, or tune.
Many Amazon Athena users run into issues, however, when it comes to scale and concurrent queries. Amazon Athena vs Presto is a common query and many users look at using a service like Athena or PrestoDB. Learn more about those challenges and why they’re moving to Ahana Cloud, SaaS for PrestoDB on AWS.
To get started with Presto for your SQL Data Lakehouse on AWS quickly, check out the services from Ahana Cloud. Ahana has two versions of their solution: a Full-Edition and a Free-Forever Community Edition. Each option has components of the SQL Lakehouse included, as well as support from Ahana. Explore Ahana’s managed service for PrestoDB.
Presto was originally designed to run interactive queries against data warehouses, but now it has evolved into a unified SQL engine on top of open data lake analytics for both interactive and batch workloads.
Both AWS Athena and Ahana Cloud are based on the popular open-source Presto project. The biggest difference between the two is that Athena is a serverless architecture while Ahana Cloud is a managed service for Presto servers.
In this blog, we discuss AWS Athena vs Presto and some of the reasons why you might choose to deploy PrestoDB on your own instead of using the AWS Athena service, like AWS pricing.
Presto has evolved into a unified engine for SQL queries on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial is about how to run SQL queries with Presto (running with Kubernetes) on AWS Redshift.
Presto’s Redshift connector allows conducting SQL queries on the data stored in an external Amazon Redshift cluster. This can be used to join data between different systems like Redshift and Hive, or between two different Redshift clusters.
Further, JDBC URL from Cluster is required to setup a redshift connector with Presto.
You can skip this section if you want to use your existing Redshift cluster, just make sure your redshift cluster is accessible from Presto, because AWS services are secure by default. So even if you have created your Amazon Redshift cluster in a public VPC, the security group assigned to the target Redshift cluster can prevent inbound connections to the database cluster. In simple words, Security Group settings of Redshift database play a role of a firewall and prevent inbound database connections over port 5439.Find the assigned Security Group and check its Inbound rules.
Step 3: Configure Presto Catalog for Amazon Redshift Connector
At Ahana we have simplified this experience and you can do this step in a few minutes as explained in these instructions.
Essentially, to configure the Redshift connector, create a catalog properties file in etc/catalog named, for example, redshift.properties, to mount the Redshift connector as the redshift catalog. Create the file with the following contents, replacing the connection properties as appropriate for your setup:
Step 4: Check for available datasets, schemas and tables, etc and run SQL queries with Presto Client to access Redshift database
After successfully database connection with Amazon Redshift, You can connect to Presto CLI and run following queries and make sure that the Redshift catalog gets picked up and perform show schemas and show tables to understand available data.
Following are some more complex queries you can run against sample data:
presto:public> -- Find top 10 buyers by quantity
->SELECT firstname, lastname, total_quantity
-> FROM (SELECT buyerid, sum(qtysold) total_quantity
-> FROM sales
-> GROUP BY buyerid
-> ORDER BY total_quantity desc limit 10) Q, users
-> WHERE Q.buyerid = userid
-> ORDER BY Q.total_quantity desc;
firstname | lastname | total_quantity
-----------+----------+----------------
Jerry | Nichols | 67
Armando | Lopez | 64
Kameko | Bowman | 64
Kellie | Savage | 63
Belle | Foreman | 60
Penelope | Merritt | 60
Kadeem | Blair | 60
Rhona | Sweet | 60
Deborah | Barber | 60
Herrod | Sparks | 60
(10 rows)
Query 20210810_185909_00217_krtkp, FINISHED, 2 nodes
Splits: 214 total, 214 done (100.00%)
0:10 [222K rows, 0B] [22.4K rows/s, 0B/s]
presto:public> -- Find events in the 99.9 percentile in terms of all time gross sales.
-> SELECT eventname, total_price
-> FROM (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) as percentile
-> FROM (SELECT eventid, sum(pricepaid) total_price
-> FROM sales
-> GROUP BY eventid)) Q, event E
-> WHERE Q.eventid = E.eventid
-> AND percentile = 1
-> ORDER BY total_price desc;
eventname | total_price
----------------------+-------------
Adriana Lecouvreur | 51846.00
Janet Jackson | 51049.00
Phantom of the Opera | 50301.00
The Little Mermaid | 49956.00
Citizen Cope | 49823.00
Sevendust | 48020.00
Electra | 47883.00
Mary Poppins | 46780.00
Live | 46661.00
(9 rows)
Query 20210810_185945_00218_krtkp, FINISHED, 2 nodes
Splits: 230 total, 230 done (100.00%)
0:12 [181K rows, 0B] [15.6K rows/s, 0B/s]
Step 5: Run SQL queries to join data between different systems like Redshift and Hive
Another great use case of Presto is Data Federation. In this example I will join Apache Hive table with Amazon Redshift table and run JOIN query to access both tables from Presto.
Here, I have two catalogs “ahana_hive” for Hive Database and “my_redshift” for Amazon Redshift and each database has my_redshift.public.users
and ahana_hive.default.customer table respectively within their schema.
Following very simple SQL queries to join these tables, the same way you join two tables from the same database.
presto> show catalogs;
presto> select * from ahana_hive.default.customer;
presto> select * from my_redshift.public.users;
presto> Select * from ahana_hive.default.customer x join my_redshift.public.users y on
x.nationkey = y.userid;
Understanding Redshift’s Limitations
Running SQL queries on Redshift has its advantages, but there are some shortcomings associated with Amazon Redshift. If you are looking for more information about Amazon Redshift, check out the pros and cons and some of the limitations of Redshift in more detail.
Start Running SQL Queries on your Data Lakehouse
We made it simple to run SQL queries on Presto in the cloud. Get started with Ahana Cloud and start running SQL in a few mins.
Amazon Redshift is a cloud data warehouse, permitting the execution of SQL queries, offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.
At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2. Users can easily run SQL queries on Redshift, but there are some limitations.
Tutorial: How to run SQL queries with Presto on BigQuery
Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial is about how to run SQL queries with Presto (running with Kubernetes) on Google BigQuery.
Pretos’s BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.
Pro-Tip: Before you move to the next step please try to use your downloaded credential JSON file with other third party sql tools like DBeaver to access your BigQuery Table. This is to make sure that your credentials have valid access rights or to isolate any issue with your credentials.
Step 4: Configure Presto Catalog for Google BigQuery Connector
To configure the BigQuery connector, you need to create a catalog properties file in etc/catalog named, for example, bigquery.properties, to mount the BigQuery connector as the bigquery catalog. You can create the file with the following contents, replacing the connection properties as appropriate for your setup. This should be done via the edit config map to make sure its reflected in the deployment:
Step 7: Check for available datasets, schemas and tables, etc
After successfully database connection with Superset, Run following queries and make sure that the bigquery catalog gets picked up and perform show schemas and show tables to understand available data.
show catalogs;
show schemas from bigquery;
show tables from bigquery.rohan88;
Step 8: Run SQL query from Apache Superset to access BigQuery table
Once you access your database schema, you can run SQL queries against the tables as shown below.
select * from catalog.schema.table;
select * from bigquery.rohan88.flights LIMIT1;
You can perform similar queries from Presto Cli as well, here is another example of running sql queries on different Bigquery dataset from Preso Cli.
Following example shows how you can join Google BigQuery table with the Hive table from S3 and run sql queries.
At Ahana, we have made it very simple and user friendly to run SQL workloads on Presto in the cloud. You can get started with Ahana Cloud today and start running sql queries in a few mins.
This article will explain what Presto is and what it’s used for.
What is an Open Data Lake in the Cloud?
Problems that necessitate a data lake
In today’s competitive landscape, more and more companies are increasingly leveraging their data to make better decisions, providing value to their customers, and improving their operations. This is obvious when understanding the environment in which these companies operate in. Data-driven insights can help business and product leaders hone in on customer needs and/or find untapped opportunities through the development of evidence-based strategies. Also, analytics dashboards can be presented to customers for added value.
Traditionally, insights are gleaned from rather small amounts of enterprise data which is what you would expect – historical information about products, services, customers, and sales. But now, the modern business must deal with 1000s of times more data, which encompasses more types of data formats and is far beyond Enterprise Data. Some current examples include 3rd party data feeds, IoT sensor data, event data, geospatial and other telemetry data.
The problem with having 1000s of times the data is that databases, and specifically data warehouses, can be very expensive when used to handle this amount. And to add to this, data warehouses are optimized to handle relational data with a well-defined structure and schema. As both data volumes and usage grow, the costs of a data warehouse can easily spiral out of control. Those costs, coupled with the inherent lock-in associated with data warehouses, have left many companies looking for a better solution, either augmenting their enterprise data warehouse or moving away from them altogether.
The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.
The Open Data Lake in the cloud centers on S3-based object storage. In AWS, there can be many S3-buckets across an organization. In Google Cloud, there is a service called Google Cloud Store (GCS) and in Microsoft Azure it is called Azure blob store. The data lake can store the relational data that typically comes from business apps like the data warehouse stores. But the data lake also stores non-relational data from a variety of sources as mentioned above. The data lake can store structured, semi-structured, and/or unstructured data.
With all this data stored in the data lake, companies can run different types of analytics directly, such as SQL queries, real-time analytics, and AI/Machine Learning. A metadata catalog of the data enables the analytics of the non-relational data.
Why Open for a Data Lake
As mentioned, companies have the flexibility to run different types of analytics, using different analytics engines and frameworks. Storing the data in open formats is the best-practice for companies looking to avoid the lock-in of the traditional cloud data warehouse. The most common formats of a modern data infrastructure are open, such as Apache Parquet and ORC. They are designed for fast analytics and are independent of any platform. Once data is in an open format like Parquet, it would follow to run open source engines like Presto on it. Ahana Cloud is a Presto managed service which makes it easy, secure, and cost efficient to run SQL on the Open Data Lake.
Best Practices for Resource Management in PrestoDB
Resource management in databases allows administrators to have control over resources and assign a priority to sessions, ensuring the most important transactions get the major share of system resources. Resource management in a distributed environment makes accessibility of data easier and manages resources over the network of autonomous computers (i.e. Distributed System). The basis of resource management in the distributed system is also resource sharing.
PrestoDB is a distributed query engine written by Facebook as the successor to Hive for highly scalable processing of large volumes of data. Written for the Hadoop ecosystem, PrestoDB is built to scale to tens of thousands of nodes and process petabytes of data. In order to be usable at a production scale, PrestoDB was built to serve thousands of queries to multiple users without facing bottle-necking and “noisy neighbor” issues. PrestoDB makes use of resource groups in order to organize how different workloads are prioritized. This post discusses some of the paradigms that PrestoDB introduces with resource groups as well as best practices and considerations to think about before setting up a production system with resource grouping.
Getting Started
Presto has multiple “resources” that it can manage resource quotas for. The two main resources are CPU and memory. Additionally, there are granular resource constraints that can be specified such as concurrency, time, and cpuTime. All of this is done via a pretty ugly JSON configuration file shown in the example below from the PrestoDB doc pages.
Okay, so there is clearly a LOT going on here so let’s start with the basics and roll our way up. The first place to start is understanding the mechanisms Presto uses to enforce query resource limitation.
Penalties
Presto doesn’t enforce any resources at execution time. Rather, Presto introduces a concept of a ‘penalty’ for users who exceed their resource specification. For example, if user ‘bob’ were to kick off a huge query that ended up taking vastly more CPU time than allotted, then ‘bob’ would incur a penalty which translates to an amount of time that ‘bob’s’ queries would be forced to wait in a queued state until they could be runnable again. To see this scenario at hand, let’s split the cluster resources by half and see what happens when two users attempt to submit 5 different queries each at the same time.
Resource Group Specifications
The example below is a resource specification of how to evenly distribute CPU resources between two different users.
The above resource config defines two main resource groups called ‘query1’ and ‘query2’. These groups will serve as buckets for the different queries/users. A few parameters are at work here:
hardConcurrencyLimit sets the number of concurrent queries that can be run within the group
maxQueued sets the limit on how many queries can be queued
schedulingPolicy ‘fair’ determines how queries within the same group are prioritized
Kicking off a single query as each user has no effect, but subsequent queries will stay QUEUED until the first completes. This at least confirms the hardConcurrencyLimit setting. Testing queuing 6 queries also shows that the maxQueued is working as intended as well.
Introducing the soft CPU limit will penalize any query that is caught using too much CPU time in a given CPU period. Currently this is set to 1 minute and each group is given half of that CPU time. However, testing the above configuration yielded some odd results. Mostly, once the first query finished, other queries were queued for an inordinately long amount of time. Looking at the Presto source code shows the reasoning. The softCpuLimit and hardCpuLimit are based on a combination of total cores and the cpuQuotaPeriod. For example, on a 10 node cluster with r5.2xlarge instances, each Presto Worker node has 8 vCPU. This leads to a total of 80 vCPU for the worker which then results in 80m of vCPUminutes in the given cpuQuotaPeriod. Therefore, the correct values are shown below.
With testing, the above resource group spec results in two queries completing – using a total of 127m CPU time. From there, all further queries block for about 2 minutes before they run again. This blocked time adds up because for every minute of cpuQuotaPeriod, each user is granted 40 minutes back on their penalty. Since the first minute queries exceeded by 80+ minutes, it would take 2 cpuQuotaPeriods to bring the penalty back down to zero so queries could submit again.
Conclusion
Resource group implementation in Presto definitely has some room for improvement. The most obvious is that for ad hoc users who may not understand the cost of their query before execution, the resource group will heavily penalize them until they submit very low cost queries. However, this solution will minimize the damage that a single user can perform on a cluster over an extended duration and will average out in the long run. Overall, resource groups are better suited for scheduled workloads which depend on variable input data so that a specified scheduled job doesn’t arbitrarily end up taking over a large chunk of resources. For resource partitioning between multiple users/teams the best approach still seems to be to run and maintain multiple segregated Presto clusters.
The term Data Lakehouse has become popular over the last year as more customers are migrating their workloads to the cloud. This article will help to explain what a Data Lakehouse is, the common architecture and how companies are using the it in production today.
Take a deep dive into Presto: what it is, how it started, and the benefits.
Configuring RaptorX – a multi-level caching with Presto
RaptorX Background and Context
Meta introduced a multi-level cache at PrestoCon 2021. Code-named the “RaptorX Project,” it aims to make Presto 10x faster on Meta- scale petabyte workloads. Here at Ahana, engineers have also been working on RaptorX to help make it usable for the community by fixing a few open issues, tuning and testing heavily with other workloads. This is a unique and very powerful feature only available in PrestoDB and not any other versions or forks of the Presto project.
Presto is the disaggregated compute-storage query engine, which helps customers and cloud providers scale independently and reduce costs. However, storage-compute disaggregation also brings new challenges for query latency as scanning huge amounts of data between storage tier and compute tier is going to be IO-bound over the network. As with any database, optimized I/O is a critical concern to Presto. When possible, the priority is to not perform any I/O at all. This means that memory utilization and caching structures are of utmost importance.
Let’s understand the normal workflow of how Presto-Hive connector works –
During a read operation, the planner sends a request to the metastore for metadata (partition info)
Scheduler sends requests to remote storage to get a list of files and does the scheduling
On the worker node, first, it receives the list of files from the scheduler and sends a request to remote storage to open a file and read the file footers
Based on the footer, Presto understands what are the data blocks or chucks we need to read from remote storage
Once workers read them, Presto performs computation on the leaf worker nodes based on join or aggregation and does the shuffle back to send query results to the client.
These are a lot of RPC calls not just for the Hive Metastore to get the partitions information but also for the remote storage to list files, schedule those files, to open files, and then to retrieve and read those data files from remote storage. Each of these IO paths for Hive connectors is a bottleneck on query performance and this is the reason we build multi-layer cache intelligently so that you can max cache hit rate and boost your query performance.
RaptorX introduces a total five types of caches and a scheduler. This cache system is only applicable to Hive connectors.
Multi-layer Cache
Type
Affinity Scheduling
Benefits
Data IO
Local Disk
Required
Reduced query latency
Intermediate Result Set
Local Disk
Required
Reduced query latency and CPU utilization for aggregation queries
File Metadata
In-memory
Required
Reduced CPU & latency decrease
Metastore
In-memory
NA
Reduced query latency
File List
In-memory
NA
Reduced query latency
Table: Summary of Presto Multi Layer Cache Implementation
Further, this article explains how you can configure and test various layers of RaptorX cache in your Presto cluster.
#1 Data(IO) cache
This cache makes use of a library which is built using the alluxio LocalCacheFileSystem which is an implementation of the HDFS interface. The alluxio data cache is the worker node local disk cache that stores the data read from the files(ORC,Parquet etc.,) on remote storage. The default page size on disk is 1MB. Uses LRU policy for evictions and in order to enable this cache we require local disks.
To enable this cache, worker configuration needs to be updated with below properties at
etc/catalog/<catalog-name>.properties
cache.enabled=true
cache.type=ALLUXIO
cache.alluxio.max-cache-size=150GB — This can be set based on the requirement.
cache.base-directory=file:///mnt/disk1/cache
Also add below Alluxio property to coordinator and worker etc/jvm.config to emit all metrics related to Alluxio cache -Dalluxio.user.app.id=presto
#2 Fragment result set cache
This is nothing but an intermediate reset set cache that lets you cache partially computed results set on the worker’s local SSD drive. This is to prevent duplicated computation upon multiple queries which will improve your query performance and decrease CPU usage.
Add the following properties under the /config.properties
A Presto coordinator caches table metadata (schema, partition list, and partition info) to avoid long getPartitions calls to metastore. This cache is versioned to confirm validity of cached metadata.
In order to enable metadata cache set below properties at /<catalog-name>.properties
A Presto coordinator caches file lists from the remote storage partition directory to avoid long listFile calls to remote storage. This is coordinator only in-memory cache.
Enable file list cache by setting below set of properties at
/catalog/<catalog-name>.properties
# List file cache
hive.file-status-cache-expire-time=24h
hive.file-status-cache-size=100000000
hive.file-status-cache-tables=*
#5 File metadata cache
Caches open file descriptors and stripe/file footer information in worker memory. These pieces of data are most frequently accessed when reading files. This cache is not just useful for decreasing query latency but also to reduce CPU utilization.
This is in memory cache and suitable for ORC and Parquet file formats.
For ORC, it includes file tail(postscript, file footer, file metadata), stripe footer and stripe stream(row indexes/bloom filters).
For Parquet, it caches the file and block level metadata.
In order to enable metadata cache set below properties at /<catalog-name>.properties
The <catalog-name> in the above configuration should be replaced by the catalog name that you are setting these in. For example, If the catalog properties file name is ahana_hive.properties then it should be replaced with “ahana_hive”.
#6 Affinity scheduler
With affinity scheduling, Presto Coordinator schedules requests that process certain data/files to the same Presto worker node to maximize the cache hits. Sending requests for the same data consistently to the same worker node means less remote calls to retrieve data.
Data caching is not supported with random node scheduling. Hence, this is a must have property that needs to be enabled in order to make RaptorX Data IO, Fragment result cache, and File metadata cache working.
In order to enable affinity scheduler set below property at /catalog.properties
hive.node-selection-strategy=SOFT_AFFINITY
How can you test or debug your RaptorX cache setup with JMX metrics?
Each section describes queries to be run and query the jmx metrics to verify the cache usage.
Note: If your catalog is not named ‘ahana_hive’, you will need to change the table names to verify the cache usage. Substitute ahana_hive with your catalog name.
Data IO Cache
Queries to trigger Data IO cache usage
USE ahana_hive.default;
SELECT count(*) from customer_orc group by nationkey;
SELECT count(*) from customer_orc group by nationkey;
Queries to verify Data IO data cache usage
-- Cache hit rate.
SELECT * from
jmx.current."com.facebook.alluxio:name=client.cachehitrate.presto,type=gauges";
-- Bytes read from the cache
SELECT * FROM
jmx.current."com.facebook.alluxio:name=client.cachebytesreadcache.presto,type=meters";
-- Bytes requested from cache
SELECT * FROM
jmx.current."com.facebook.alluxio:name=client.cachebytesrequestedexternal.presto,type=meters";
-- Bytes written to cache on each node.
SELECT * from
jmx.current."com.facebook.alluxio:name=Client.CacheBytesWrittenCache.presto,type=meters";
-- The number of cache pages(of size 1MB) currently on disk
SELECT * from
jmx.current."com.facebook.alluxio:name=Client.CachePages.presto,type=counters";
-- The amount of cache space available.
SELECT * from
jmx.current."com.facebook.alluxio:name=Client.CacheSpaceAvailable.presto,type=gauges";
-- There are many other metrics tables that you can view using the below command.
SHOW TABLES FROM
jmx.current like '%alluxio%';
Fragment Result Cache
An example of the query plan fragment that is eligible for having its results cached is shown below.
SELECT count(*) from customer_orc;
SELECT count(*) from customer_orc;
Query Fragment Set Result cache JMX metrics.
-- All Fragment result set cache metrics like cachehit, cache entries, size, etc
SELECT * FROM
jmx.current."com.facebook.presto.operator:name=fragmentcachestats";
ORC metadata cache
Queries to trigger ORC cache usage
SELECT count(*) from customer_orc;
SELECT count(*) from customer_orc;
Query ORC Metadata cache JMX metrics
-- File tail cache metrics
SELECT * FROM
jmx.current."com.facebook.presto.hive:name=ahana_hive_orcfiletail,type=cachestatsmbean";
-- Stripe footer cache metrics
SELECT * FROM
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripefooter,type=cachestatsmbean";
-- Stripe stream(Row index) cache metrics
SELECT * FROM
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripestream,type=cachestatsmbean";
Parquet metadata cache
Queries to trigger Parquet metadata cache
SELECT count(*) from customer_parquet;
SELECT count(*) from customer_parquet;
Query Parquet Metadata cache JMX metrics.
-- Verify cache usage
SELECT * FROM
jmx.current."com.facebook.presto.hive:name=ahana_hive_parquetmetadata,type=cachestatsmbean";
File List cache
Query File List cache JMX metrics.
-- Verify cache usage
SELECT * FROM
jmx.current."com.facebook.presto.hive:name=ahana_hive,type=cachingdirectorylister";
In addition to this, we have enabled these multilayer caches on Presto for Ahana Cloud by adding S3 support as the external filesystem for Data IO cache, more optimized scheduling and tooling to visualize the cache usage.
Figure: Multi-level Data Lake Cashing with RaptorX
Ahana-managed Presto clusters can take advantage of RaptorX cache and at Ahana we have simplified all these steps so that data platform users can enable these Data Lake caching seamlessly with just a one click. Ahana Cloud for Presto enables you to get up and running with the Open Data Lake Analytics stack in 30 minutes. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our on-demand webinar where we share how you can build an Open Data Lake Analytics stack.
To learn how to benchmark Presto, Let’s first start by covering the basics. Presto is an open source MPP Query engine designed from the ground up for high performance with linear scaling. Businesses looking to solve their analytics workload using Presto need to understand how to evaluate Presto performance and this technical guide will help in the endeavor. Learn how to get started with running your own benchmark.
To help users who would like to benchmark Presto, we have written a detailed, informative guide on how to set up your PrestoDB benchmark using benchto. Benchto is an open source framework that provides an easy and manageable way to define, run and analyze macro benchmarks in clustered environments.
Running a benchmark on PrestoDB can help you to identify things like:
system resource requirements
resource usage during various operations
performance metrics for such operations
…and more, depending on your workload and use case
This technical guide provides an overview on TPC-H, the industry standard for benchmarking. It also goes into detail to explain how to configure and use the open-source Benchto tool to benchmark Presto. In addition, it will show an example on comparing results between two different runs of an Ahana-managed Presto cluster with and without cache enabled.
AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. Customers can manage permissions to data in a single place, making it easier to enforce security across a wide range of tools and services. Over the past several months we’ve worked closely with the AWS Lake Formation team to bring Lake Formation capabilities to Presto on AWS. Further, we’re grateful to our customers who were willing to preview early versions of our integration.
Today, Ahana Cloud for Presto allows customers to use Presto to query their data protected with AWS Lake Formation fine-grained permissions with a few clicks. Our customers can bring Presto to their existing AWS stack and scale their data teams without compromising security. We’re thrilled that the easiest managed service for Presto on AWS just got easier and more secure.
Here’s a quick video tutorial that shows you how easy it is to get started with AWS Lake Formation and Ahana:
Additionally, we’ve put together a list of resources where you can learn more about the integration.
PrestoCon 2021: Authorizing Presto with AWS Lake Formation presented by Jalpreet Singh Nanda (Software Engineer, Ahana) and Roy Hasson (Principal Product Manager, Amazon)
If you’re ready to get started with AWS Lake Formation and Ahana Cloud, head over to our account sign up page where you can start with a free 14-day trial of Ahana Cloud. You can also drop us a note at product@ahana.io and we can help get you started. Happy building!
What is Presto on Spark?
Overview
Presto was originally designed to run interactive queries against data warehouses. However, now it has evolved into a unified SQL engine on top of open data lake analytics for both interactive and batch workloads. Popular workloads on data lakes include:
1. Reporting and dashboarding
This includes serving custom reporting for both internal and external developers for business insights and also many organizations using Presto for interactive A/B testing analytics. A defining characteristic of this use case is a requirement for low latency. It requires tens to hundreds of milliseconds at very high QPS, and not surprisingly this use case is almost exclusively using Presto. That’s what Presto is designed for.
2. Data science with SQL notebooks
This use case is one of ad hoc analysis and typically needs moderate latency ranging from seconds to minutes. These are the queries of data scientist, and business analysts who want to perform compact ad hoc analysis to understand product usage, for example, user trends and how to improve the product. The QPS is relatively lower because users have to manually initiate these queries.
3. Batch processing for large data pipelines
These are scheduled jobs that are running every day, hour, or whenever the data is ready. They often contain queries over very large volumes of data and the latency can be up to tens of hours and processing can range from CPU days to years and terabytes to petabytes of data.
Presto works exceptionally effectively for ad-hoc or interactive queries today, and even some batch queries, with the constraint that the entire query must fit in memory and run quickly enough that fault tolerance is not required. Most ETL batch workloads that don’t fit in this box are running on “very big data” compute engines like Apache Spark. Having multiple compute engines with different SQL dialects and APIs makes managing and scaling these workloads complicated for data platform teams. Hence, Facebook decided to simplify and build Presto on Spark as the path to further scale Presto. Before we get into Presto on Spark, let me explain a bit more about the architecture of each of these two popular engines.
Presto’s Architecture
Presto is designed for low latency and follows the classic MPP architecture; it uses in-memory streaming shuffle to achieve low latency. Presto has a single shared coordinator per cluster with an associated pool of workers. Presto tries to schedule as many queries as possible on the same Presto worker (shared executor), in order to support multi-tenancy.
This architecture provides very low latency scheduling of tasks and allows concurrent processing of multiple stages of a query, but the tradeoff is that the coordinator is a SPOF and bottleneck, and queries are poorly isolated across the entire cluster.
Additionally streaming shuffle does not allow for much fault tolerance further impacting the reliability of long running queries.
Spark’s Architecture
On other hand, Apache Spark is designed for scalability from the very beginning and it implements a Map-Reduce architecture. Shuffle is materialized to disk fully between stages of execution with the capability to preempt or restart any task. Spark maintains an isolated Driver to coordinate each query and runs tasks in isolated containers scheduled on demand. These differences improve reliability and reduce overall operational overhead.
Why Presto alone isn’t a good fit for batch workloads?
Scaling an MPP architecture database to batch data processing over Internet-scale datasets is known to be an extremely difficult problem [1]. To simplify this let’s examine the below aggregation query. Essentially this query goes over the orders table in TPCH and does aggregation grouping on custom keys, and summing the total price. Presto leverages in-memory shuffle and executes shuffle on the custom key, after reading the data and doing aggregation for the same key, on each worker.
Doing in-memory shuffle means the producer will buffer data in memory and wait for the data to be fetched by the consumer as a result. We have to execute all the tasks, before and after the exchange at the same time. So thinking about in the mapreduce world all the mappers and the reducer have to be run concurrently. This makes in-memory shuffle an all-or-nothing exclusion model.
This causes inflexible scheduling and scaling query size becomes more difficult because everything is running concurrently. In the aggregation phase the query may exceed the memory limit because everything has to be held in the memory in hash tables in order to track each group (custkey).
Additionally we are limited by the size of a cluster in how many nodes we can hash partition the data across to avoid having to fit it all in memory. Using distributed disk (Presto-on-Spark, Presto Unlimited) we can partition the data further and are only limited by the number of open files and even that is a limit that can be scaled quite a bit by a shuffle service.
For that reason it makes Presto difficult to scale to very large and complex batch pipelines. Such pipelines remain running for hours, all to join and aggregate over a huge amount of data. This motivated the development of Presto Unlimited which adapts Presto’s MPP design to large ETL workloads, and improves user experience at scale.
While Presto Unlimited solved part of the problem by allowing shuffle to be partitioned over distributed disk, it didn’t fully solve fault tolerance. Additionally, it did nothing to improve isolation and resource management.
Presto on Spark
Presto on Spark is an integration between Presto and Spark that leverages Presto’s compiler/evaluation as a library with Spark’s RDD API used to manage execution of Presto’s embedded evaluation. This is similar to how Google chose to embed F1 Query inside their MapReduce framework.
The high level goal is to bring a fully disaggregated shuffle to Presto’s MPP run time and we achieved this by adding a materialization step right after the shuffle. The materialized shuffle is modeled as a temporary partition table, which brings more flexible execution after shuffle and allows to partition level retries. With Presto on Spark, we can do a fully disaggregated shuffle on custom keys for the above query both on mapper and reducer side, this means all mappers and reducers can be independently scheduled and are independently retriable.
Presto On Spark at Intuit
Superglue is a homegrown tool at Intuit that helps users build, manage and monitor data pipelines. Superglue was built to democratize data for analysts and data scientists. Superglue minimizes time spent developing and debugging data pipelines, and maximizes time spent on building business insights and AI/ML.
Many analysts at Intuit use Presto (AWS Athena) to explore data in the Data Lake/S3. These analysts would spend several hours converting these exploration SQLs written for Presto to Spark SQL to operationalize/schedule them as data pipelines in Superglue. To minimize SQL dialect conversion issues and associated productivity loss for analysts, the Intuit team started to explore various options including query translation, query virtualization, and presto on spark. After a quick POC, Intuit decided to go with Presto on Spark. This is because it leverages Presto’s compiler/evaluation as a library (no query conversion is required) and Spark’s scalable data processing capabilities.
Presto on Spark is now in production at Intuit. In three months, there are hundreds of critical pipelines that have thousands of jobs running on Presto On Spark via Superglue.
Presto on Spark runs as a library that is submitted with spark-submit or Jar Task on the Spark cluster. Scheduled batch data pipelines are launched on ephemeral clusters to take advantage of resource isolation, manage cost, and minimize operational overhead. DDL statements are executed against Hive and DML statements are executed against Presto. This enables analysts to write Hive-compatible DDL and the user experience remains unchanged.
This solution helped enable a performant and scalable platform with seamless end-to-end experience for analysts to explore and process data. It thereby improved analysts’ productivity and empowered them to deliver insights at high speed.
When To Use Presto on Spark
Spark is the tool of choice across the industry for running large scale complex batch ETL pipelines. Presto on Spark heavily benefits pipelines written in Presto that operate on terabytes/petabytes of data. It takes advantage of Spark’s large scale processing capabilities. The biggest win here is that no query conversion is required and you can leverage Spark for
Scaling to larger data volumes
Scaling Presto’s resource management to larger clusters
Increase reliability and elasticity of Presto as a compute engine
Why Presto on Spark matters
We tried to achieve the following to adapt ‘Presto on Spark’ to Internet-scale batch workloads [2]:
Fully disaggregated shuffles
Isolated executors
Presto resource management, Different Scheduler, Speculative Execution, etc.
A unified option for batch data processing and ad hoc is important for creating the experience of queries that scale. More queries will fail without requiring rewrites between different SQL dialects. We believe this is only a first step towards more confluence between the Spark and the Presto communities. This is also major step towards enabling unified SQL experience between interactive and batch use cases. Today many internet giants like Facebook, have moved over to Presto on Spark. We have even seen many organizations including Intuit start running their complex data pipelines in production with Presto on Spark.
“Presto on Spark” is one of the most active development areas in Presto, feel free check it out and please give it a star! If you have any questions, feel free to ask in the PrestoDB Slack Channel.
Tutorial: How to define SQL functions with Presto across all connectors
Presto is a widely used SQL query engine for data lakes that comes equipped with many built-in functions to serve most use cases. However, there may be certain situations where you need to define your own function. Luckily, Presto allows you to define dynamic expressions as SQL functions, which are stored separately from the Presto source code and managed by a functions namespace manager. You can even set up this manager with a MySQL database. This feature is so popular at Facebook that they have over a thousand functions defined in their instance of Presto.
Function Namespace Manager
By organizing data in catalogs and schemas, Presto allows users to easily access and manipulate data from multiple sources as if they were a single database.
A catalog is a logical namespace that represents a collection of data sources that can be queried in Presto. Each catalog contains one or more schemas, which are essentially named containers that hold tables, views, and other objects.
A function namespace is a special catalog.schema that stores functions in the format like mysql.test. It is possible to make each catalog.schema a function namespace. A function namespace manager is a type of plugin that handles a collection of these function catalog schemas. Catalogs can be assigned to connectors in Presto which allows the Presto engine to carry out tasks like creating, modifying, and deleting functions.
This user defined function management is separated from connector API for flexibility, hence these SQL functions can be used across all connectors. Further, the query is guaranteed to use the same version of the function throughout the execution and any modification to the functions is versioned.
Implementation
Today, function namespace manager is implemented with the help of MySQL, so users need to have a running MySQL service to initialize the MySQL based function namespace manager.
Step 1: Provision MySQL server and generate jdbc url for further access.
Suppose the MySQL server can be reached at localhost:3306, example database url –
Here is another simple example of creating an EpochTimeToLocalDate function to convert Unix time to local timezone under ahana.default function namespace.
presto> CREATE FUNCTION ahana.default. EpochTimeToLocalDate (x bigint)
-> RETURNS timestamp
-> LANGUAGE SQL
-> DETERMINISTIC RETURNS NULL ON NULL INPUT
-> RETURN from_unixtime (x);
CREATE FUNCTION
ahana.default.EpochTimeToLocalDate(1629837828) as date;
date
-------------------------
2021-08-24 13:43:48.000
(1 row)
Note
function-namespaces-table-name <The name of the table that stores all the function namespaces managed by this manager> property can be used if there is a use case to instantiate multiple function namespace managers, otherwise if we can create functions in only one function namespace manager then it can be utilized across all different databases or connectors. [2]
At Ahana we have simplified all these steps that is MySQL container, Schema, databases, tables and additional configurations required to manage functions and data platforms users just need to create their own SQL functions and use them in SQL queries, that’s it, no need to be wary about provisioning and managing additional MySQL servers.
Future Roadmap
Remote function Support with remote UDF thrift API
Allows you to run arbitrary functions that are either not safe or not possible to run within worker JVM: unreliable Java functions, C++, Python, etc.
Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial we will show you how to run Presto with AWS Glue as a catalog on a laptop.
We did mention in the tutorial Presto 104 why we are using a catalog. Just to recap, Presto is a disaggregated database engine. This means that Presto has the top part of the database stack – the SQL parser, compiler, optimizer, scheduler, execution engine – but it does not have other components of the database – this includes the system catalog. In the data lake world, the system catalog where the database schema resides is a catalog. There are two popular catalogs that have emerged – Hive Metastore and AWS Glue catalog.
What is AWS Glue?
AWS Glue is an event-driven, serverless computing platform provided by AWS. AWS Glue provides data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. The AWS Glue catalog does the mapping between the database tables and columns and the objects or files that reside in the data lake. This could be files or immutable objects in AWS S3.
In this tutorial, we will focus on using Presto with the AWS Glue on your laptop.
This document simplifies the process for a laptop scenario to get you started. For real production workloads, you can try out Ahana Cloud which is a managed service for Presto on AWS and comes pre-integrated with an AWS Glue catalog.
Implementation steps
Step 1:
Create a docker network namespace, so that containers could communicate with each other using the network namespace.
Ahanaio has developed a sandbox for prestodb which can be downloaded from docker hub, use the command below to download prestodb sandbox, which comes with all packages needed to run prestodb.
We only want the coordinator to be running on this container without the worker node. So let’s edit the config.properties file and set the node-scheduler.include-cooridinator to false.
Restart the docker container running coordinator. Since we updated the config file to run this instance only as a Presto coordinator and stopped the worker service.
# docker restart coordinator
Step 6:
Create three more containers using ahanaio/prestodb-sandbox
Edit the etc/config.properties file in each of the three worker containers and set coordinator to false, http-server.http.port to 8081/8082/8083 respectively for each worker and finally discovery.uri should point to coordinator.
Run aws configure and enter the access and secret key configured.
aws configure AWS Access Key ID [None]: XXXXXXXXXXXXXAWS Secret Access Key [None]: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Default region name [None]: Default output format [None]
Step 12:
Create /opt/presto-server/etc/catalog/glue.properties file add the AWS Glue properties to presto, this file needs to be added on both coordinator and worker containers. Add the AWS access and secret keys generated in the previous step to hive.metastore.glue.aws-access-key and hive.metastore.glue.aws-secret-key.
In this tutorial, we provide steps to use Presto with AWS Glue as a catalog on a laptop. If you’re looking to get started easily with Presto and a pre-configured Glue catalog, check out Ahana Cloud, a managed service for Presto on AWS that provides both Hive Metastore and AWS Glue as a choice of catalog for prestodb.
Presto 104: Running Presto with Hive Metastore on your Laptop
Introduction
This is the 4th tutorial in our Getting Started with Presto series. To recap, here are the first 3 tutorials:
Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial we will show you how to run Presto with Hive Metastore on a laptop.
Presto is a disaggregated engine. This means that Presto has the top part of the database stack – the SQL parser, compiler, optimizer, scheduler, execution engine – but it does not have other components of the database – this includes the system catalog. In the data lake world, the system catalog where the database scheme resides lives in what is called a Catalog. There are two popular catalogs that have emerged. From the Hadoop world – the Hive Metastore continues to be widely used. Note this is different from the Hive Query Engine. This is the system catalog – where information about the table schemas and their locations lives. In AWS, the Glue catalog is also very popular.
In this tutorial, we will focus on using Presto with the Hive Metastore on your laptop.
What is the Hive Metastore?
The Hive Metastore is the mapping between the database tables and columns and the objects or files that reside in the data lake. This could be a file system when using HDFS or immutable objects in object stores like AWS S3. This document simplifies the process for a laptop scenario to get you started. For real production workload using Ahana cloud which provides Presto as a managed service with Hive Metastore will be a good choice if you are looking at an easy and performant solution for SQL on AWS S3.
Implementation steps
Step 1
Create a docker network namespace, so that containers could communicate with each other using the network namespace.
Ahanaio has developed a sandbox for prestodb which can be downloaded from docker hub, use the command below to download prestodb sandbox, which comes with all packages needed to run prestodb.
We only want the coordinator to be running on this container without the worker node. So let’s edit the config.properties file and set the node-scheduler.include-cooridinator to false.
Restart the docker container running coordinator. Since we updated the config file to run this instance only as a Presto coordinator and stopped the worker service.
# docker restart coordinator
Step 6:
Create three more containers using ahanaio/prestodb-sandbox
Edit the etc/config.properties file in each of the three worker containers and set coordinator to false, http-server.http.port to 8081/8082/8083 respectively for each worker and finally discovery.uri should point to coordinator.
Create /opt/presto-server/etc/catalog/hive.properties file add the hive endpoint to presto, this file needs to be added on both coordinator and worker containers.
If you choose to validate using AWS S3 bucket provide security credentials for the same.
presto:default> create schema tpch with (location='file:///root'); CREATE SCHEMA presto:default> use tpch;
If you have access to S3 bucket then use the following create command using s3 as destination
presto:default> create schema tpch with (location='s3a://bucket_name'); CREATE SCHEMA presto:default> use tpch;
Step 16:
Hive has option to create two types of table, they are
Managed tables
External tables
Managed tables are tightly coupled with data on the destination which means if you delete a table then associated data will also be deleted.
External tables are loosely coupled with data, which means it maintains a pointer to the data.so deletion of the table will not delete data on the external location.
The transactional semantics(ACID) is only supported on managed tables.
We will create managed table under hive.tpch schema
In this tutorial, we provide steps to use Presto with Hive Metastore as a catalog on a laptop. Additionally AWS Glue can also be used as a catalog for prestodb. If you’re looking to get started easily with Presto and a pre-configured Hive Metastore, check out Ahana Cloud, a managed service for Presto on AWS that provides both Hive Metastore and AWS Glue as a choice of catalog for prestodb.
Connect Superset to Presto
This blog post will provide you with an understanding of how to connect Superset to Presto.
TL;DR
Superset refers to a connection to a distinct data source as a database. A single Presto cluster can connect to multiple data sources by configuring a Presto catalog for each desired data source. Hence, to make a Superset database connection to a particular data source through Presto, you must specify the Presto cluster and catalog in the SQLAlchemy URI as follows: presto://<presto-username>:<presto-password>@<presto-coordinator-url>:<http-server-port>/<catalog>.
Superset and SQLAlchemy
Superset is built as a Python Flask web application and leverages SQLAlchemy, a Python SQL toolkit, to provide a consistent abstraction layer to relational data sources. Superset uses a consistent SQLAlchemy URI as a connection string for a defined Superset database. The schema for the URI is as follows: dialect+driver://username:password@host:port/database. We will deconstruct the dialect, driver, and database in the following sections.
SQLAlchemy defines a dialect as the system it uses to communicate with the specifics various databases (e.g. flavor of SQL) and DB-API, low level Python APIs to talk to specific relational data sources. A Python DB-API database driver is required for a given data source. For example, PyHive is a DB-API driver to connect to Presto. It is possible for a single dialect to choose between multiple DB-API drivers. For example, PostgreSQL dialect can support the following DB-API drivers: psycopg2, pg8000, psycop2cffi, an pygresql. Typically, a single DB-API driver is set as the default for a dialect and used when no explicit DB-API is specified. For PostgreSQL, the default DB-API driver is psycopg2.
The term database can be confusing since it is heavily loaded. In a typical scenario a given data source, such as PostgeSQL, have multiple logical groupings of tables which are called “databases”. In a way, these “databases” provide namespaces for tables; identically named tables can exist in two different “databases” without collision. As an example, we can use the PostgreSQL instance available when locally installing Superset with Docker Compose.
In this instance of PostgreSQL, we have four databases: postgres, superset, template0, and template1.
We can look into the superset database and see the tables in that database.
The key thing to remember here is that ultimately a Superset database needs to resolve to a collection of tables, whatever that is referred to in a particular dialect.
superset@localhost:superset> \c superset
You are now connected to database "superset" as user "superset"
+--------+----------------------------+-------+----------+
| Schema | Name | Type | Owner |
|--------+----------------------------+-------+----------|
| public | Clean | table | superset |
| public | FCC 2018 Survey | table | superset |
| public | ab_permission | table | superset |
| public | ab_permission_view | table | superset |
| public | ab_permission_view_role | table | superset |
| public | ab_register_user | table | superset |
| public | ab_role | table | superset |
| public | ab_user | table | superset |
| public | ab_user_role | table | superset |
| public | ab_view_menu | table | superset |
| public | access_request | table | superset |
| public | alembic_version | table | superset |
| public | alert_logs | table | superset |
| public | alert_owner | table | superset |
| public | alerts | table | superset |
| public | annotation | table | superset |
| public | annotation_layer | table | superset |
| public | bart_lines | table | superset |
| public | birth_france_by_region | table | superset |
| public | birth_names | table | superset |
| public | cache_keys | table | superset |
| public | channel_members | table | superset |
| public | channels | table | superset |
| public | cleaned_sales_data | table | superset |
| public | clusters | table | superset |
| public | columns | table | superset |
| public | covid_vaccines | table | superset |
:
With an understanding of dialects, drivers, and databases under our belt, let’s solidify it with a few examples. Let’s assume we want to create a Superset database to a PostgreSQL data source and particular PostgreSQL database named mydatabase. Our PostgreSQL data source is hosted at pghost on port 5432 and we will log in as sonny (password is foobar). Here are three SQLAlchemy URIs we could use (actually inspired from the SQLAlchemy documentation):
postgresql+psycopg2://sonny:foobar@pghost:5432/mydatabase We explicitly specify the postgresql dialect and psycopg2 driver.
postgresql+pg8000://sonny:foobar@pghost:5432/mydatabase We use the pg8000 driver.
postgresql://sonny:foobar@pghost:5432/mydatabase We do not explicitly list any driver, and hence, SQLAlchemy will use the default driver, which is psycopg2 for postgresql.
Because Presto can connect to multiple data sources, when connecting to Presto as a defined Superset database, it’s important to understand what you are actually making a connection to.
In Presto, the equivalent notion of a “database” (i.e. logical collection of tables) is called a schema. Access to a specific schema (“database”) in a data source, is defined in a catalog.
As an example, the listing below is the equivalent catalog configuration to connect to the example mydatabase PostgreSQL database we described previously. If we were querying a table in that catalog directly from Presto, a fully-qualified table would be specified as catalog.schema.table (e.g. select * from catalog.schema.table). Hence, querying the Clean table would be select * from postgresql.mydatabase.Clean.
Going back to Superset, to create a Superset database to connect to Presto, we specify the Presto dialect. However, because Presto is the intermediary to an underlying data source, such as PostgreSQL, the username and password we need to provide (and authenticate against) is the Presto username and password. Further, we must specify a Presto catalog for the database in the SQLAlchemy URI. From there, Presto—-through its catalog configuration—-authenticates to the backing data source with the appropriate credentials (e.g sonny and foobar ). Hence, the SQLAlchemy URI to connect to Presto in Superset is as follows: presto://<presto-username>:<presto-password>@<presto-coordinator-url>:<http-server-port>/<catalog>
The http-server-port refers to the http-server.http.port configuration on the coordinator and workers (see Presto config properties); it is usually set to 8080.
New Superset Database Connection UI
In Superset 1.3, there is a feature-flagged version of a new database connection UI that simplifies connecting to data without constructing the SQLAlchemy URI. The new database connection UI can be turned on in config.py with FORCE_DATABASE_CONNECTIONS_SSL = True (PR #14934). The new UI can also be viewed in the Superset documentation.
Try It Out!
In less than 30 minutes, you can get up and running using Superset with a Presto cluster with Ahana Cloud for Presto. Ahana Cloud for Presto is an easy-to-use fully managed Presto service that also automatically stands up a Superset instance for you. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.
Presto Tutorial 103: PrestoDB cluster on GCP
Introduction
This tutorial is Part III of our Getting started with PrestoDB series. As a reminder, Prestodb is an open source distributed SQL query engine. In tutorial 102 we covered how to run a three node prestodb cluster on a laptop. In this tutorial, we’ll show you how to run a prestodb cluster in a GCP environment using VM instances and GKE containers.
Environment
This guide was developed on GCP VM instances and GKE containers.
Presto on GCP with VMs
Implementation steps for prestodb on vm instances
Step1: Create a GCP VM instance using the CREATE INSTANCE tab, name it as presto-coordinator. Next, create three more VM instances as presto-worker1, presto-worker2 and presto-worker3 respectively.
Step 2: By default GCP blocks all network ports, so prestodb will need ports 8080-8083 enabled. Use the firewalls rule tab and enable them.
Step 3:
Install JAVA and python.
Step 4:
Download the Presto server tarball, presto-server-0.253.1.tar.gz and unpack it. The tarball will contain a single top-level directory, presto-server-0.253.1 which we will call the installation directory.
Run the commands below to install the official tarballs for presto-server and presto-cli from prestodb.io
user@presto-coordinator-1:~$ curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.235.1/presto-server-0.235.1.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 721M 100 721M 0 0 245M 0 0:00:02 0:00:02 --:--:-- 245M
user@presto-coordinator-1:~$ curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.235.1/presto-cli-0.235.1-executable.jar
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12.7M 100 12.7M 0 0 15.2M 0 --:--:-- --:--:-- --:--:-- 15.1M
user@presto-coordinator-1:~$
Step 5:
Use gunzip and tar to unzip and untar the presto-server
Check the cluster UI status. It should show the Active worker count at 0 since we enabled only the coordinator.
Step 10:
Repeat steps 1 to 8 on the remaining 3 vm instances which will act as worker nodes.
On the configuration step for worker nodes, set coordinator to false and http-server.http.port to 8081, 8082 and 8083 for worker1, worker2 and worker3 respectively.
Also make sure node.id and http-server.http.port are different for each worker node.
Edit the etc/config.parameters file inside the container and set the node-scheduler.include-coordinator property to false. Now restart the coordinator.
user@cloudshell:~ (weighty-list-324021)$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE <none> <none> 46ab5129fe8a 15 seconds ago 1.81GB ahanaio/prestodb-sandbox latest 76919cf0f33a 34 hours ago 1.81GB
user @cloudshell:~ (weighty-list-324021)$ docker tag 46ab5129fe8a coordinator
user@cloudshell:~ (weighty-list-324021)$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE coordinator latest 46ab5129fe8a About a minute ago 1.81GB ahanaio/prestodb-sandbox latest 76919cf0f33a 34 hours ago 1.81GB
Step 9:
Create tag with artifacts path and copy it over to artifacts location
user@cloudshell:~ docker tag coordinator:latest us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/coord:v1
user@cloudshell:~ (weighty-list-324021)$ kubectl get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.7.240.1 <none> 443/TCP 41m presto-coordinator LoadBalancer 10.7.248.10 35.239.88.127 8080:30096/TCP 92s
Step 11:
Copy the external IP on a browser and check the status
Step 12:
Now to deploy worker1 into the GKE container, again start a local instance named worker1 using the docker run command.
Edit worker1 config.properties inside the worker1 container to set coordinator to false and http-server.http.port to 8081. Also the discovery.uri should point to the coordinator container running inside the GKE container.
user@cloudshell:~ (weighty-list-324021)$ docker exec -it worker1 bash bash-4.2# vi etc/config.properties bash-4.2# vi etc/config.properties bash-4.2# cat etc/config.properties coordinator=false http-server.http.port=8081 discovery.uri=http://presto-coordinator01:8080
Step 14:
Stop the local worker1 container, commit the worker1 as image and tag it as worker1 image
user@cloudshell:~ (weighty-list-324021)$ docker stop worker1 worker1 user@cloudshell:~ (weighty-list-324021)$ docker commit worker1 sha256:cf62091eb03702af9bc05860dc2c58644fce49ceb6a929eb6c558cfe3e7d9abf ram@cloudshell:~ (weighty-list-324021)$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE <none> <none> cf62091eb037 6 seconds ago 1.81GB
user@cloudshell:~ (weighty-list-324021)$ docker tag cf62091eb037 worker1:latest user@cloudshell:~ (weighty-list-324021)$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE worker1 latest cf62091eb037 2 minutes ago 1.81GB
Step 15:
Push the worker1 image into google artifacts location
user@cloudshell:~ (weighty-list-324021)$ docker tag worker1:latest us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker1:v1
Check presto UI for successful deployment of worker1
Step 18:
Repeat steps 12 to steps 17 to deploy worker2 inside GKE container:
deploy ahana local instance using docker and name it as worker2,
then edit the etc/config.properties inside the worker2 container to set coordinator to false, port to 8082 and discover.uri to the coordinator container name.
shut the instance then commit that instance and create docker image as worker2
push that worker2 image to google artifacts location
use kubectl commands to deploy and expose the worker2 instance inside a google container. Now check the prestodb UI for the second worker being active.
Check prestodb UI for successful deployment of worker2
user@cloudshell:~ (weighty-list-324021)$ docker run -d -p 8080:8080 -it –name worker2 worker1 32ace8d22688901c9fa7b406fe94dc409eaf3abfd97229ab3df69ffaac00185d user@cloudshell:~ (weighty-list-324021)$ docker exec -it worker2 bash bash-4.2# vi etc/config.properties bash-4.2# cat etc/config.properties coordinator=false http-server.http.port=8082 discovery.uri=http://presto-coordinator01:8080 bash-4.2# exit exit user@cloudshell:~ (weighty-list-324021)$ docker commit worker2 sha256:08c0322959537c74f91a6ccbdf78d0876f66df21872ff7b82217693dc3d4ca1e user@cloudshell:~ (weighty-list-324021)$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE <none> <none> 08c032295953 11 seconds ago 1.81GB
user@cloudshell:~ (weighty-list-324021)$ docker tag 08c032295953 worker2:latest
Verify the prestodb environment by running the prestodb CLI with simple JMX query
user@presto-coordinator-1:~/presto-server$ ./presto-cli presto> SHOW TABLES FROM jmx.current; Table ———————————————————————————————————————————– com.facebook.airlift.discovery.client:name=announcer com.facebook.airlift.discovery.client:name=serviceinventory com.facebook.airlift.discovery.store:name=dynamic,type=distributedstore com.facebook.airlift.discovery.store:name=dynamic,type=httpremotestore com.facebook.airlift.discovery.store:name=dynamic,type=replicator
Summary
In this tutorial you learned how to provision and run prestodb inside Google VM instances and on GKE containers. Now you should be able to validate the functional aspects of prestodb.
If you want to run production Presto workloads at scale and performance, check out https://www.ahana.io which provides a managed service for Presto.
Announcing the workload profile feature in Ahana Cloud
Ahana Cloud for Presto is the first fully integrated, cloud native managed service that simplifies the ability of cloud and data platform teams. With the managed Presto service, we provide a lot of tuned configurations out of the box for Ahana customers.
We’re excited to announce that the workload profile feature is now available on Ahana Cloud. With this release, users can create a cluster with a validated set of configurations that suits the type of workloads or queries users plan to run.
Today, the Presto clusters are configured with default properties that work well for generic workloads. However, achieving workload-specific tuning and resource allocation requires a good understanding of presto’s resource consumption for that workload. Further, to change or add any property, we first update the configuration file and then restart the cluster. This makes data platform users spend a couple of days and in many cases weeks iterating, evaluating, and experimenting with the config tuning to reach the ideal configuration, specific to the workloads. To solve this pain point and to deliver predictable performance at scale, Ahana Cloud allows user to select tuned set of properties for desired workloads in a one click away.
Here is the short demo of creating Presto Cluster with a workload profile:
Concurrent queries are simply the number of queries executing at the same time in a given cluster. To simplify this experience we have classified workloads based on the number of concurrent queries and curated a set of tuned session properties for each profile.
Low concurrency is useful for clusters that run a limited number of queries or a few large, complex queries. It also supports bigger and heavier ETL jobs.
High concurrency is better for running multiple queries at the same time. For example, dashboard and reporting queries or A/B testing analytics, etc.
This setting can be changed once the cluster has been created and cluster restart is not required. However, the change will only apply to new queries. Following is the short demo on how you can change these profiles for running clusters.
This feature is the beginning of auto-tune capabilities for workload management. We are continuously innovating Ahana Cloud for our customers and to deliver a seamless, easy experience for data teams looking to leverage the power of Presto. Please give it a try and log in to the Ahana Cloud console to get started. We have a free trial as well that you can sign up for today.
Announcing the Ahana $20M Series A – Furthering our Vision of Open Data Lake Analytics with Presto
I’m very excited to announce that Ahana, the SaaS for Presto company, has raised a jumbo $20M Series A round from lead investor Third Point Ventures. Our SaaS managed service for open source Presto enables data platform teams to easily and cost effectively deliver powerful SQL processing on all their data. This is our vision of Open Data Lake Analytics and it’s what Facebook, Uber, Twitter, and others in the Presto Foundation have been running at scale.
It’s been only 15 months since the founding of Ahana and I’m extremely proud of what the team has achieved to date. We came out of stealth last year with seed investments totalling $4.8M from GV (Google Ventures), Lux Capital, and Leslie Ventures, who all also participated in this oversubscribed A round. Initially we focused on the Presto community, providing support, open source contributions, and tutorials to help people get started easily with Presto. Subsequently, we announced the first managed service for Presto at PrestoCon in September and GA’d two months later at AWS Re:Invent in December with early customer acclaim.
I am also excited to have Rob Schwartz, Managing Partner of Third Point Ventures join our board and to partner with Third Point that has tremendous public market investing expertise, relationships, and research capabilities. Third Point has over $25B under management and Rob drives their early stage, growth, and cross-over investments in their emerging technology arm, Third Point Ventures. I value his hands-on experience with young companies, helping them deliver on their vision. For example, he is an active investor in Yellowbrick in the data space. Rob and Third Point had four key reasons to invest in Ahana…the team, the project, the market, and the product-market fit as evidenced by customer traction. Let me touch on these four areas:
THE TEAM: This is my 6th startup tour of duty and I’ll be the first to attest that startups require an incredible amount of energy. That energy comes from the synchronized rowing of the oars to propel the ship, no matter the size (or startup stage of company growth). I’m most proud of our extraordinary team; all their hands on deck that have pulled together so far. The Ahana team includes experts in a range of industry-leading databases: Teradata, IBM DB2, Vertica, Aster Data, and the recently IPO’d Couchbase. In addition to experts in open source & Presto hailing from Facebook, Uber, Walmart and Alibaba.
THE OPEN SOURCE PRESTO PROJECT: Since we started Ahana last year, the momentum we’ve seen in the Presto community has been phenomenal. We helped the Presto Foundation lead two massively successful PrestoCon’s and numerous meetups across the world, whose membership have crossed over 1000 members. The Docker Sandbox Container for Presto hosted by Ahana has had over 250K pulls just in the last 6 months, and 10 companies are now part of the growing Presto Foundation consortium. We’re both thankful to all those in the community who have helped make Presto what it is and humbled to be a part of its success. We pledge to continue open source code contributions for the benefit of the Presto community for many years to come.
THE MARKET: This $20MM series A raise enables us to further our vision of providing the most flexible and performant Open Data Lake Analytics with Presto. Open Data Lake Analytics is quickly becoming the next big emerging stack, led by hyperscaler bellwether companies like Facebook and Uber. While the data warehouse has been the workhorse of analytics, it’s also very expensive. To address that, more data is moving into cloud data lakes like AWS S3 because they are so inexpensive and ubiquitous. So we’re seeing the data warehouse getting deconstructed in the cloud with the commodified data lake coupled with open source Presto. This stack enables SQL analytics directly on the data lake, making it the most cost effective, open, and flexible solution.
PRODUCT-MARKET FIT: Ahana is at the intersection of some rapidly growing trends right now – Open Source, Cloud Analytics, and the Data Lake. And our customer base is proof of that. Having only GA’d this past December, the adoption we’ve seen has been incredible. Companies like Cartona, an eCommerce company out of Egypt, and Carbon, a fast growing ad tech company, are using Ahana and telling the world about their use cases. Securonix, the Gartner magic quadrant leader in the security SIEM space, is a huge proponent of Presto and Ahana and recently joined us at the AWS Startup Showcase. Our customers are building out reporting and dashboarding, customer-facing analytics, data transformation use cases, and much more with Ahana Cloud for Presto. We can’t wait to see what they do next.
Lastly, I’ll mention that this raise enables us to accelerate growth in three main areas:
Technical innovation of the Presto project, by scaling our product and engineering teams. Btw, we’re an all remote company.
Adoption via more evangelism for the Presto open source project. We will continue working closely with the community and other Presto Foundation members like Facebook, Uber, and Intel.
Growing our Marketing and Sales organizations, continuing our focus on customer adoption.
To sum it up, I’d like to share what Rob Schwartz, Managing Partner of Third Point Ventures says about us:
“We’re excited to join the exceptional team at Ahana and assist them in their evolution from rapid, early adoption to substantial market prominence. As we witness the evolution of modern analytics, we’re seeing a new stack emerge adjacent to the data warehouse. Companies need an open, flexible approach to access their data, and the data lake with Presto on top provides that. Ahana Cloud provides the foundation for this new stack, giving all platform teams the ability to easily use Presto on their data lake. With Ahana, any company can leverage the open data lake for their analytics. This is extremely powerful.”
Cheers to our next phase of growth, and did I mention we’re hiring? 😉
We are just getting started. Join us on this incredible journey at the intersection of cloud, data, and open source…what many unicorns are made of.
Autoscale your Presto cluster in Ahana Cloud
Presto Cluster Autoscaling
Tip: if you are looking for a more thorough understanding of using PrestoDB check out the free technical ebook, Learning and Operating Presto.
We are beyond excited to announce that autoscaling is now available on Ahana Cloud and will assist you with maintaining a Presto cluster. In this initial release, the autoscaling feature will assist to monitor the worker nodes’ average CPU Utilization of your PrestoDB worker nodes and scale-out when reaching the 75% threshold. In addition to this, Presto clusters have now the ability to scale-in to a minimum number of worker nodes when the cluster is idle for a user-specified amount of time.
Never run out of memory with autoscaling
One of the challenges of running a Presto cluster is to make the right decision in terms of the number of worker nodes required to run your queries. Not all queries are equals and predicting how many nodes will be required is not always possible. With the scale-out feature, the number of worker nodes increases based on the CPU utilization to ensure that your queries can execute without running out of memory. That way you don’t have to worry about whether your deployment can support your requirements. Future iterations will include scale-in based on CPU utilization and autoscaling based on additional metrics.
Save cost with Idle time
When no queries are sent to a Presto cluster, it would make sense to reduce or condense the number of workers nodes but it’s not always practical to do so manually. With the Idle time feature enabled, the system will monitor the queries activity, if no activity is detected for a user-defined period of time, let’s say 15mins, then the number of worker nodes will reduce to its minimum count.
Two of the most common use cases we have found that benefit greatly from idle time cost saving are transformation workloads and ad hoc querying.
For the transformation workload, the query can potentially run for several hours, making it unpractical to monitor its activity to decide when to manually stop the cluster or reduce the number of running nodes. Idle time cost savings wait for a certain period of inactivity and then reduce the worker node to a minimum automatically until the next query hits the cluster again.
For ad hoc querying, like its name suggests, the querying is not continuous and scaling in to a minimum worker node count between each queries will help reduce costs.
Enabling autoscaling with a Presto Cluster
Getting started with autoscaling a Presto cluster is easy. We’ll demonstrate just how simply it is with this step-by-step walkthrough.
Step 1 – First, in your Cluster settings select Scale Out only (CPU) scaling strategy
Step 2 – Enter a Minimum and a Maximum worker node count as well as a Scale Out step size. The scale-out step size will decide how many nodes get added to the cluster when the scaling out triggers
Step 3 – By default, the cluster will resize to its minimum worker node count defined above after 30mins, this can be set between 10mins and 1 hour
Your new Presto cluster will scale-out up to its maximum worker node count as long as the average CPU Utilization of the worker nodes goes beyond 75%. However, if no queries reach the cluster for a default period of 30mins then the cluster will reduce its worker node count to its minimum.
Enabling Idle time cost saving
Enabling Idle time cost saving is very easy with this step-by-step walkthrough. As shown in the section above, idle time cost saving is enabled by default in the Scale Out only (CPU) scaling strategy.
For the Static cluster, to enable the feature, you will need to do the following:
Step 1 – Check Scale to a single worker node when idle
Step 2 – By default, the cluster will resize to a single worker node after 30mins, this can be set between 10 mins and 1 hour.
Changing the autoscaling configuration of an existing cluster
You can always change the configuration after a cluster got created by following the steps below:
Step 1 – Navigate to the cluster details view
Step 2 – Edit the cluster scaling policy configuration
Step 3 – The server will update its configuration immediately after clicking the Save button
What’s next?
Ready to take the next step with Presto to see this in action? Log in to the Ahana Cloud console to get started. You can also learn more about autoscaling by heading over to our documentation.
Amazon Redshift is a cloud data warehouse, permitting the execution of SQL queries, offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.
At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2. Users can easily run SQL queries on Redshift, but there are some limitations.
Are you using Presto and want more information to simplify your work processes? Start here. Learn how to connect Superset to Presto. This article will break down the steps needed to use Presto with Superset and get you moving faster.
Snowflake may not be the silver bullet you wanted for your long term data strategy… here’s why
Since COVID, every business has pivoted and moved everything online, accelerating digital transformation with data and AI. Self-service, accelerated analytics has become more and more critical for businesses and Snowflake did a great job bringing cloud data warehouses into the market when users were struggling with on-prem big data solutions and trying to catch up their cloud journey. Snowflake is designed foundationally to take advantage of the cloud’s benefits, and while Snowflake has benefited from a first-mover advantage, here are the key areas you should think about as you evaluate a cloud data warehouse like Snowflake.
Open Source; Vendor Lock-in
Using an SQL engine that is open source is strategically important because it allows the data to be queried without the need to ingest it into a proprietary system. Snowflake is not Open Source Software. Only data that has been aggregated and moved into Snowflake in a proprietary format is available to its users. Moreover, Snowflake is pushing back on open source due to their proprietary solutions. Recently, Snowflake announced Snowflake Data Cloud, where they position Snowflake as a platform for “Cloud Data” where organizations can move and store all the data.
However, surrendering all your data to the Snowflake data cloud model creates vendor lock-in challenges:
Excessive cost as you grow your data warehouse
If ingested into another system, data is typically locked into formats of the closed source system
No community innovations or way to leverage other innovative technologies and services to process that same data
Snowflake doesn’t benefit from community innovation that true open source projects benefit from. For example, an open source project like Presto has many contributions from engineers across Twitter, Uber, Facebook, Ahana and more. At Twitter, engineers are working on the Presto-Iceberg connector, aiming to bring high-performance data analytics on open table format to the Presto ecosystem.
Check this short session on an overview of how we are evolving Presto to be the next generation query engine at Facebook and beyond.
With a proprietary technology like Snowflake, you miss out on community-led contributions that can shape a technology for the best of everyone.
Open Format
Snowflake has chosen to use a micro-partition file format that might be good for performance but is closed source. The Snowflake engine cannot work directly with most common open formats like Apache Parquet, Apache Avro, Apache ORC, etc. Data can be imported from these open formats to an internal Snowflake file format, but you miss out on the performance optimizations that these open formats can bring to your engine, including dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes, avoid many small files, avoid few huge files, etc.
On the other hand, Presto users can use Tensorflow on the same open formats, like parquet and ORC, so there’s a lot of flexibility that you get with this open data lake architecture. Using open formats gives users the flexibility to pick the right engine for the right job without the need for an expensive migration.
While migrating from legacy data warehouse platforms to Snowflake may offer less friction for cloud adoption, trying to integrate open source formats into a single proprietary solution may not be as simple as sold.
A SQL engine is needed for both the data lake where raw data resides as well as the broad range of other data sources so that an organization can mix and match data from any source. If your data resides in relational databases, NoSQL databases, cloud storages, file systems like HDFS, etc., then Snowflake is not suitable for your self-service data lake strategy.. You can not run SQL queries across data stored in relational, non-relational, object, and custom data sources using Snowflake.
Workload Flexibility
Today users want to create new applications at the same rate as their data is growing and a single database is not a solution to support a broad range of analytical use cases. One common workload is training and using machine learning models right over warehouse tables or streaming analytics. Snowflake focuses on a traditional data warehouse as a managed service on the cloud and requires proprietary connectors to address these ML/DS workloads, which brings up data lineage challenges.
If you have a lot of unstructured data like text or images, the volume is beyond petabytes, or schema-on-read is a must-have feature, then Snowflake does not fit into your data lake architecture.
The new generation of open platforms that unifies the data warehouse and advanced analytics is something that Snowflake is not fundamentally designed for; Snowflake is only suitable for data warehouse use cases.
Data Ownership
Snowflake did decouple storage and compute. However, Snowflake does not decouple data ownership. It still owns the compute layer as well as the storage layer. This means users must ingest data into Snowflake using a proprietary format, creating yet another copy of data and also requiring users to move their data out of their own environment. Users lose ownership of their data.
Cost
Users think of Snowflake as an easy and low cost model. However, it gets very expensive and cost prohibitive to ingest data into a Snowflake. Very large data and enterprise grade long running queries can result in significant costs associated with Snowflake.
As Snowflake is not fully decoupled, data is copied and stored into Snowflake’s managed cloud storage layer within Snowflake’s account. Hence, the users end up paying a higher cost to Snowflake than the cloud provider charges, not to mention the costs associated with cold data. Further, Security features come with a higher price with a proprietary tag.
Conclusion
Snowflake may sound appealing in how simple it is to implement a data warehouse in the cloud. However, an open data lake analytics strategy will augment the data warehouse system in places the warehouse may fall short as discussed above, providing significant long-term strategic benefits to users.
With PrestoDB as a SQL Engine for Open Data Lake Analytics, you can execute SQL queries at high-performance, similar to the EDW. Using Presto for your data lake analytics means you don’t have to worry about vendor lock-in and gets you the benefits of open source goodness like RaptorX, Project Aria, Apache Ranger integration, etc. Check this short tutorial on how to query a data lake with Presto.
While powerful, Presto can be complex and resource-intensive when it comes to managing and deploying. That’s where Ahana comes in. Ahana Cloud is the easiest managed service for PrestoDB in the cloud. We simplify open source operational challenges and support top innovations in the Presto community.
As Open Data Lake Analytics evolves, we have a great and advanced roadmap ahead. You can get started with Ahana Cloud today.
Presto 102 Tutorial: Install PrestoDB on a Laptop or PC
Summary
Prestodb is an open source distributed parallel query SQL engine. In tutorial 101 we walk through manual installation and configuration on a bare metal server or on a VM. It is a very common practice to try prestodb on a laptop for quick validation and this guide, Tutorial 102, will walk through simple steps to install a three node prestodb cluster on a laptop.
Environment
This guide was developed using a laptop running windows OS and docker on it.
Steps for Implementing Presto
Step 1:
Create a docker network namespace, so that containers could communicate with each other using the network namespace.
Ahana has developed a sandbox for prestodb which can be downloaded from docker hub, use the below command to download prestodb sandbox, which comes with all packages needed to run prestodb.
Check cluster UI on the status, by default the Ahana prestodb sandbox comes with one worker and a coordinator.
If only the coordinator needs to be running without the worker node then edit the config.properties file and set the node-scheduler.include-cooridinator to false.
Since sandbox comes with coordinator it needs to be disabled for the second instance and run it as worker node, to do that click on the terminal window on the docker container/apps UI and edit etc/config.properties file to set coordinator to false and set http port to be different from coordinator.
Restart the worker1 container check prestodb UI, now the active worker count will be set to either 1 if co-ordinator runs without a worker node or 2 if the coordinator also runs a worker node.
Step 8:
Repeat steps 5 to 7 to add a third worker node.
start new instance of Ahanaio/prestodb sandbox
Disable coordinator and set port to be different than coordinator and set URI to the container name of the coordinator
Prestodb cluster installation is simplified with Ahana prestodb sandbox. It’s now ready to be used for any functional validation.
Enabling spill to disk for optimal price per performance
Presto was born out of the need for low-latency interactive queries on large scale data, and hence, continually optimized for that use case. In such scenarios, the best practice is to properly size Presto Worker nodes such that all the aggregate cluster memory can fit all the data required for target data sources, queries, and concurrency level. In addition, to ensure fairness of memory across queries and prevent deadlocks, by default, Presto will kill queries that exceed configured memory limits.
The Case for Spill to Disk
As Presto usage and adoption continues to grow, it is being used for more and more different use cases. For some of these use cases, the need for full memory bandwidth and low-latency is not necessary. For example, consider long running queries on large historical data, such as logs, where low latency results is not paramount. In these cases, it may be acceptable, and even more optimal overall, to trade some performance for cost savings. One way to achieve this is, of course, to use lower memory Presto Workers. However, perhaps these longer batch workloads where higher latency is tolerable is not the norm, but the minority case. Enter Presto spill to disk functionality where Presto can be configured to spill intermediate data from memory to disk when needed. While queries that spill to disk have longer execution times compared to an entire in-memory equivalent, the query will not fail due to exceeding configured memory properties.
Cost Savings of Spill to Disk
Let’s walk through a practical example of a spill to disk scenario. A 15 Worker Presto cluster of r5.2xlarge instances (64 GB memory, 8 vCPU, $0.5 per hour) in AWS costs about $180 per day with an aggregate cluster memory of close to 1 TB (960 GB actual). Instead of a 15 Worker Presto cluster, if we had a cluster with 30% less Presto Workers at 10 nodes, we would be decreasing the aggregate cluster memory by 320 GB. But, let’s say with augment the cluster with 1 TB (> 3x of 350 GB) of disk storage across all the remaining 10 nodes (100 GB per node) to leverage Presto disk spilling. At $0.10 per GB-month for gp2 EBS volumes, the storage costs is only $100 per month. The storage costs is less than a 1% of the memory costs, even with a 3x factor.
Spill to Disk Configuration
There are several configuration properties that need to be set to use spill to disk and they are documented in the Presto documentation. Here is an example configuration with 50 GB of storage allocated to each Worker for spilling.
experimential.spiller-spill-path: Directoy where spilled content will be written.
experimental.spiller-max-used-space-threshold: If disk space usage ratio of a given spill path is above this threshold, the spill path will not be eligible for spilling.
experimental.max-spill-per-node: Max spill space to be used by all queries on a single node.
experimental.query-max-spill-per-node: Max spill space to be used by a single query on a single node.
experimental.max-revocable-memory-per-node: How much revocable memory any one query is allowed to use.
Conclusion
Several large scale Presto deployments take advantage of spill to disk, including Facebook. Today, the Ahana Cloud managed Presto service enables spill to disk by default and sets the per node spill to disk limit at 50 GB. We will be releasing the ability for customers to configure and tune their per node spill-to-disk size soon. Give it a try. You can sign up and start using our service today for free.
Presto can be installed manually or using docker images on:
Single Node: Both co-ordinator and workers run on the same machine.
or even multiple machines depending on the workload requirements.
Manual Installing Presto
Download the Presto server tarball, presto-server-0.253.1.tar.gz and unpack it. The tarball will contain a single top-level directory, presto-server-0.253.1 which we will call the installation directory.
Run the commands below to install the official tarballs for presto-server and presto-cli from prestodb.io
Presto needs a data directory for storing logs, etc. We recommend creating a data directory outside of the installation directory, which allows it to be easily preserved when upgrading Presto.
[root@prestodb_c01 ~]# mkdir -p /var/presto/data
Configuration Settings
Create an etc directory inside the installation directory. This will hold the following configuration:
Node Properties: environmental configuration specific to each node
JVM Config: command-line options for the Java Virtual Machine
Config Properties: configuration for the Presto server
Catalog Properties: configuration for Connectors (data sources)
[root@prestodb_c01 ~]# mkdir etc
Node Properties
The node properties file, etc/node.properties contains configuration specific to each node. A node is a single installed instance of Presto on a machine. This file is typically created by the deployment system when Presto is first installed. The following is a minimal etc/node.properties:
node.environment: The name of the environment. All Presto nodes in a cluster must have the same environment name.
node.id: The unique identifier for this installation of Presto. This must be unique for every node. This identifier should remain consistent across reboots or upgrades of Presto. If running multiple installations of Presto on a single machine (i.e. multiple nodes on the same machine), each installation must have a unique identifier.
node.data-dir: The location (filesystem path) of the data directory. Presto will store logs and other data here.
JVM configuration
The JVM config file, etc/jvm.config, contains a list of command-line options used for launching the Java Virtual Machine. The format of the file is a list of options, one per line. These options are not interpreted by the shell, so options containing spaces or other special characters should not be quoted.
The following provides a good starting point for creating etc/jvm.config:
Because an OutOfMemoryError will typically leave the JVM in an inconsistent state, we write a heap dump (for debugging) and forcibly terminate the process when this occurs.
Config Properties
The config properties file, etc/config.properties, contains the configuration for the Presto server. Every Presto server can function as both a coordinator and a worker, but dedicating a single machine to only perform coordination work provides the best performance on larger clusters.
In order to set up a single machine for testing that will function as both a coordinator and worker, then set the below parameters to true in etc/config.properties
coordinator: Allow this Presto instance to function as a coordinator (accept queries from clients and manage query execution).
node-scheduler.include-coordinator: Allow scheduling work on the coordinator.
http-server.http.port: Specifies the port for the HTTP server. Presto uses HTTP for all communication, internal and external.
query.max-memory: The maximum amount of distributed memory that a query may use.
query.max-memory-per-node: The maximum amount of user memory that a query may use on any one machine.
query.max-total-memory-per-node: The maximum amount of user and system memory that a query may use on any one machine, where system memory is the memory used during execution by readers, writers, and network buffers, etc.
discovery-server.enabled: Presto uses the Discovery service to find all the nodes in the cluster. Every Presto instance will register itself with the Discovery service on startup. In order to simplify deployment and avoid running an additional service, the Presto coordinator can run an embedded version of the Discovery service. It shares the HTTP server with Presto and thus uses the same port.
discovery.uri: The URI to the Discovery server. Because we have enabled the embedded version of Discovery in the Presto coordinator, this should be the URI of the Presto coordinator. Replace example.net:8080 to match the host and port of the Presto coordinator. This URI must not end in a slash.
You may also wish to set the following properties:
jmx.rmiregistry.port: Specifies the port for the JMX RMI registry. JMX clients should connect to this port.
jmx.rmiserver.port: Specifies the port for the JMX RMI server. Presto exports many metrics that are useful for monitoring via JMX.
Log Levels
The optional log levels file, etc/log.properties allows setting the minimum log level for named logger hierarchies. Every logger has a name, which is typically the fully qualified name of the class that uses the logger.
There are four levels: DEBUG, INFO, WARN and ERROR.
Catalog Properties
Presto accesses data via connectors, which are mounted in catalogs. The connector provides all of the schemas and tables inside of the catalog.
Catalogs are registered by creating a catalog properties file in the etc/catalog directory. For example, create etc/catalog/jmx.properties with the following contents to mount the jmx connector as the jmx catalog
The installation directory contains the launcher script in bin/launcher. Presto can be started as a daemon by running the following:
[root@hsrhvm01 presto-server-0.235.1]# bin/launcher start Started as 23378
After launching, you can find the log files in var/log:
launcher.log: This log is created by the launcher and is connected to the stdout and stderr streams of the server. It will contain a few log messages that occur while the server logging is being initialized and any errors or diagnostics produced by the JVM.
server.log: This is the main log file used by Presto. It will typically contain the relevant information if the server fails during initialization. It is automatically rotated and compressed.
http-request.log: This is the HTTP request log which contains every HTTP request received by the server. It is automatically rotated and compressed.
Why I’m betting on PrestoDB, and why you should too!
I’ve been in open source software companies and communities for over 10 years now, and in the database industry my whole career. So I’ve seen my fair share of the good, the bad, and the ugly when it comes to open source projects and the communities that surround them. And I’d like to think that all this experience has led me to where I am today – cofounder of Ahana, a company that’s betting big on an open source project: PrestoDB. Let’s first talk about the problem we’re trying to solve.
The Big Shift
Organizations have been using costly, proprietary data warehouses as the workhorse for analytics over many years. And in the last few years, we’ve seen a shift to cloud data warehouses, like Snowflake, AWS RedShift and BigQuery
Couple that with the fact that organizations have much more data (from terabytes to 10’s and 100’s of terabytes to even petabytes) and more different types of data (telemetry, behavioral, IoT, and event data in addition to enterprise data), there’s even greater urgency for users to not get locked in to proprietary formats and proprietary systems.
These shifts along with AWS commoditizing the storage layer that is ubiquitous and affordable means that a lot more data is now in cloud object stores like AWS S3, Google Cloud Storage and Azure Blob Store.
So how do you query the data stored in these data lakes? How do you ask questions of the data and pull the answers into reports and dashboards?
The “SQL on S3” Problem
Data lakes are only storage – there is no intelligence in the data lake
If the data is structured, think in tables and columns, SQL is hands down the best way to query it – hey it’s survived 50+ years
If data is semi-structured, think nested like JSON etc. SQL still can be used to query with extensions to the language
But SQL on data lakes was complicated. Hadoop tried it, we know that didn’t work. While it sort of solved this problem, trying to get 70+ different components and projects to integrate and work turned out to be a nightmare for data platform teams.
There was really no simple yet performant way of querying S3
That is where Presto comes in.
Presto is the best engine built to directly query open formats and data lakes. Presto replaced Hive at Facebook. Presto is the heart of the modern Open Data Lake Analytics Stack – an analytics stack that includes open source, open formats, open interfaces, and open cloud. (You can read more details about this in my Dataversity article.)
The fact of the matter is that this problem can be solved with many different open source projects/engines. At Ahana, we chose Presto – the engine that runs at Facebook, Uber and Twitter.
Why PrestoDB?
1. Crazy good tech
Presto is in memory, scalable and built like a database. And with the new innovations getting added, Presto is only becoming bigger and better.
Like I mentioned earlier, I’ve been in open source and the database space for a long time and built multiple database engines – both structured and semi-structured (aka NoSQL). I believe that PrestoDB is the query engine most aligned with the direction the analytics market is headed. At Ahana, we’re betting on PrestoDB and have built a managed service around it that solves the problem for Open Data Lake Analytics – the SQL on S3 problem.
No other open source project has come close. Some that come to mind –
Apache Drill unfortunately lost its community. It had a great start being based on the Dremel paper (published by Google) but over time didn’t get the support it should have from the vendor behind it and the community fizzled.
SparkSQL (built on Apache Spark) isn’t built like a database engine but instead is bolted on top of a general purpose computation engine
Trino, a hard fork of Presto, is largely focused on a different problem of broad access across many different data sources versus the data lake. I fundamentally believe that all data sources are NOT equal and that data lakes will be the most important data source over time, overtaking the data warehouses. Data sources are not equal for a variety of reasons:
Amount of data stored
Type of information stored
Type of analysis that can be supported on it
Longevity of data stored
Cost of managing and processing the data
Given that 80-90% of the data lives on S3, I’ve seen that 80-90% of analytics will be run on this cheapest data source. That data source is the data lake. And the need to perform a correlation across more data sources comes up only 5-10% of the time for a window of time until the data from those data sources also gets ingested into the data lake.
As an example: MySQL is the workhorse for operational transactional systems. A complex analytical query with multi-way joins from a federated engine could bring down the operational system. And while there may be a small window of time when the data in MySQL is not available in the data lake, it will eventually be moved over to the lake, and that’s where the bulk of the analysis will happen.
2. Vendor-neutral Open Source, not Single-vendor Open Source
On top of being a great SQL query engine, Presto is open source and most importantly part of Linux Foundation – governed with transparency and neutrality. On the other hand, Trino, the hard fork of Presto, is a single-vendor project, with most of the core contributors being employees of the vendor. This is problematic for any company planning on using a project as a core component of their data infrastructure, moreso for a vendor like Ahana that needs to be able to support its customers and contribute and enhance the source code.
Presto is hosted under Linux Foundation in the Presto Foundation, similar to how Kubernetes is hosted by Cloud Native Computing Foundation (CNCF) under the Linux Foundation umbrella. Per the bylaws of the Linux Foundation, Presto will always stay open source and vendor neutral. It was a very important consideration for us that we could count on the project remaining free and open forever, given so many examples where we have seen single-vendor projects changing their licenses to be more restrictive over time to meet the vendor’s commercialization needs.
For those of you that follow open source, you most likely saw the recent story on Elastic changing its open-source license which created quite a ripple in the community. Without going into all the details, it’s clear that this move prompted a backlash from a good part of the Elastic community and its contributors. This wasn’t the first “open source” company to do this (see MongoDB, Redis, etc.), nor will it be the last. I believe that over time, users will always pick a project driven by a true gold-standard foundation like Linux Foundation or Apache Software Foundation (ASF) over a company-controlled project. Kubernetes over time won the hearts of engineers over alternatives like Apache Mesos and Docker Swarm and now has one of the biggest, most vibrant user communities in the world. I see this happening with Presto.
3. Presto runs in production @ Facebook, runs in production @ Uber & runs in production @ Twitter.
The most innovative companies in the world run Presto in production for interactive SQL workloads. Not Apache Drill, not SparkSQL, not Apache Hive and not Trino. The data warehouse engine at Facebook is Presto. Ditto that for Uber, likewise for Twitter.
When a technology is used at the scale these giants run at, you know you are not only getting technology created by the brightest minds but also tested at internet scale.
The Conclusion
And that’s why I picked Presto, that’s PrestoDB and there’s only 1 Presto.
In summary, I believe that Presto is the de facto standard for SQL analytics on data lakes. It is the heart of the modern open analytics stack and will be the foundation of the next 10 years of open data lake analytics.
Join us on our mission to make PrestoDB the open source, de facto standard query engine for everyone.
5 main reasons Data Engineers move from AWS Athena to Ahana Cloud
In this brief post, we’ll discuss the 5 main reasons why data platform engineers decide to move their data analytics workloads from Amazon Athena to Ahana Cloud for Presto.
While AWS Athena’s serverless architecture means users don’t need to scale, provision, or manage any servers, there are trade-offs with a serverless approach around performance, pricing, and several technical limitations.
What are AWS Athena and Ahana Cloud for Presto?
Presto is an open source distributed SQL query engine designed for petabyte-scale interactive analytics against a wide range of data sources, from your data lake to traditional relational databases.
Ahana Cloud for Presto provides a fully managed Presto cloud service in AWS, with a wide range of native Presto connectors support, IO caching, optimized configurations for your workload.
AWS Athena is a serverless interactive query service built on Presto that developers use to query AWS S3-based data lakes and other data sources.
While there are some benefits to AWS Athena, let’s talk about why the data engineers we talk to migrate to Ahana Cloud.
1. Need for Concurrency & Partitions
AWS Athena maximum concurrency is limited to 20-25 queries depending on the region, and users must request increased quotas. Some users even observe a max concurrency nearer to 3. Athena users can only run up to 5 queries simultaneously for each account, and Athena restricts each account to 100 databases. Athena’s partition limit is 20K partitions per table when using the Hive catalog. These limitations pose challenges if you have a complex query in-front of queries that are more latency-sensitive workloads like serving up results to a user-facing dashboard.
Ahana Cloud on the other hand runs any amount of queries when you need them. You have full transparency into what’s going on under the hood. You get unlimited concurrency because you can simply scale the number of distributed workers.
2. Need for Performance predictability
When using AWS Athena you don’t control the number of underlying servers that AWS allocates to Athena to run your queries. As the Athena service is shared, the performance characteristics can change frequently and substantially. One minute there may be 50 servers, the next only 10 servers.
With Ahana Cloud, because you have full control of your deployment, performance is always consistent and many times, faster than Athena.
3. Need for more Data source connectors
AWS Athena doesn’t use native Presto connectors, so you’ll need to use the limited options AWS provides or build your own with the AWS Lambda service.
In Ahana Cloud, you can define and manage data sources in the SaaS console, you can also attach or detach them from any cluster with the click of a button. Connect your existing Amazon database services like RDS / MySQL, RDS / PostgreSQL, Elasticsearch and Amazon Redshift.
4. Need for control over the underlying engine
AWS Athena’s serverless nature may make it easy to use, but it also means users have no control over adding more sessions, resources, debugging, etc.
In Ahana Cloud however, you control the number of Presto nodes in your deployment, and you choose the node instance-types for optimum price/performance. That’s easy with the full visibility provided via dashboards on performance and query management.
5. Need for Price predictability
AWS Athena billing is per query, based on volume of data scanned, making it inefficient and expensive at scale. Because costs are hard to control and predict, it leads to “bill shock” for some users. If one query scans one terabyte, that’s $5 for a few seconds.
Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, increased resilience, speed, and ease of use. Plus, Ahana is pay-as-you-go pricing – only pay for what you use. Using the same example, $5 lets you run a 6 node cluster of r5.xlarge instances for an hour, or hundreds of queries instead of just one.
Summary
AWS Athena Serverless architecture makes it really easy to get started with, however, the service has many different limitations that can cause problems, and many data engineering teams have spent hours trying to diagnose them. Due to these limitations, AWS Athena can run slowly and increase operational costs.
Ahana Cloud for Presto is the first fully integrated, cloud-native managed service for Presto that simplifies the ability of cloud and data platform teams of all sizes to provide self-service, SQL analytics for their data analysts and scientists. And all this without the limits of AWS Athena.
Ahana Cloud is available in AWS. You can sign up and start using our service today for free.
Presto Foundation compared to Single Vendor-Driven Open Source Projects
Why Project Neutrality Matters
Steven Mih, Cofounder and CEO, Ahana
PrestoCon Day was last week on March 24, and it was an incredible event with lots of great speakers, great attendance, and positivity around the Presto community. I gave a lightning talk called the “Top Ten Reasons to Use and Contribute to Presto.”
In Letterman style, it’s a read out of each reason, starting from 10 and counting down to 1. Number 9 was:
Presto is neutrally governed with the stated principle that no one individual or company can control the project.
At this point in my presentation, some attendees spammed the chat thread with links claiming that Trino has neutrality, implying that Presto does not. (Here’s a snippet of that song if you’re interested:)
This blog will share data points on why this claim on neutrality is false.
First off, what is Neutrality?
Some say the word “neutrality” with a tone of righteousness, implying freedom or independence of some sort. But that isn’t what the word means. As a noun, “neutrality” simply means impartiality. That when there is a decision to be made there isn’t support for one side vs. another.
Presto Foundation is neutrally-governed
Here’s two data points that show Presto is neutrally-governed:
#1: Linux Foundation’s Presto Foundation has adopted these principles since Dec 2019:
Slide from PrestoCon Day Opening Presentation
#2: The Presto Foundation Governing Board operates with these principles, with oversight from Linux Foundation. They have grown and diversified Presto Foundation membership from 4 member companies to 10 member companies today. They have worked with the project’s Technical Steering Committee to grow representation from 2 companies and to 5 companies today.
Presto is controlled by multiple end-user companies and vendor companies with a common set of principles and Linux Foundation oversight.
So Presto is a neutrally-governed OSS project.
Single vendor-driven OSS projects don’t have neutrality
Single vendor OSS projects are ones where the majority of committers work for one vendor company – see companies like MongoDB, Elastic, and Starburst. In this case, OSS projects are tied to vendors and their commercial agendas – they are not impartial, so they are not neutral.
With that in mind, here are two data points that show Trino is a single vendor-driven project:
#1) The Trino Software Foundation owners are Martin Traverso, Dain Sundstrom and David Phillips:
“The foundation will be led by the three original creators of Presto: Martin Traverso, Dain Sundstrom and David Phillips.” (source: press release)
#2) The Starburst cofounders and CTO’s are Martin Traverso, Dain Sundstrom and David Phillips.
Trino is in a foundation that is controlled by cofounders of one vendor company, so it is a single vendor-driven OSS project.
Therefore, Trino does not have neutrality.
Why Neutrally-Governed Open Source Matters to You
For a Linux Foundation open source project, neutrality principles give users and developers confidence that they will be treated without a commercial agenda. In practice, some examples are:
Breaking changes to the code base have to be proposed and agreed.
Your code contributions will be viewed purely on their technical merit.
Your ability to become a committer is not limited by a vendor agenda.
More community innovation at the core of the project, not just peripheral features.
The project’s open source license will not be changed.
I’ve been an employee in 3 different single vendor-driven OSS projects before and have seen the kinds of decisions that are made in those circumstances. The common pattern is what I call the “EE vs. CE” internal debate between sales and developers on how to balance the open sourcing of new features in the Community Edition (CE) while keeping the money-making features proprietary in the Enterprise Edition (EE). As the company grows there’s almost always a shift to the money-making features of the company.
Join the Presto community!
We believe in Presto and the community benefits of a neutral governance model.
The combination of stated principles and oversight has helped Presto flourish, instead of leaning towards the agenda of any one individual, set of individuals, or company. The Presto Foundation works impartially together as a community.
Upsolver and Ahana Announce Technology Partnership to Accelerate Adoption of Open Data Lake Analytics in the Cloud with Presto
Todd Odess, Head of Partnerships at Upsolver | Ali LeClerc, Head of Marketing at Ahana
Today we’re thrilled to announce our joint technology partnership. Upsolver, the first no-code, high-speed cloud compute layer for making data lakes analytics-ready and Ahana, the managed service for Presto, offer an end-to-end cloud-native solution for open data lake analytics. This solution gives anyone the ability to prepare data for analysis and query. Upsolver’s visual UI lets users easily land and automatically optimize data in AWS S3, and then query it with Ahana Cloud for Presto.
Customers get to value quickly and affordably, without having to write and maintain Spark-based data pipelines or manage complex query engines. Over the long-term, they benefit from greater analytics agility, being able to productionize new use cases in days rather than weeks or months.
Companies today collect and manage massive amounts of data – on AWS S3 alone there are 10,000+ data lakes. But studies show that companies only analyze 12% of their data; the rest is locked up in siloed systems, expensive on-premises solutions, and incompatible data formats. Open Data Lake Analytics helps users unlock that data.
Open Data Lake Analytics is an approach to traditional data analytics that opens up a wider range of data for analytics in the cloud. It’s built on open source Presto, the distributed SQL query engine, and supports open formats (Apache Parquet, Apache ORC, and more) and open interfaces that integrate with existing SQL systems. It can also run on any cloud.
Together, Ahana and Upsolver give organizations a cloud-native platform that analyzes large volumes of data at any speed – real-time or batch – automating the convoluted process of building cloud data lakes plus managing Presto. Working together enables data teams to manage complex pre-query data engineering and a distributed query engine without having deep technical knowledge.
Upsolver provides a visual, SQL-based interface to easily create and manage tables, as well as deep integration with AWS S3, AWS Glue Catalog, and PrestoDB for high performance queries.
We believe this partnership that provides open data lake analytics is critical to maximize the value of our customers’ data lakes. Together we accelerate time to value on data lake analytics projects and make it simple enough for any data practitioner to analyze any dataset.
This joint solution brings the power of technologies used at internet-scale companies like Facebook, Uber, Twitter to data teams of any size. It helps companies expand beyond the cloud data warehouse to a multifaceted analytical stack, opening up a broad set of advanced use cases and the flexibility.
We look forward to this new partnership with Upsolver and making the open cloud data lake accessible for everyone.
Additionally, Ahana and Upsolver are both part of the AWS Partner Network and premier members of the Linux Foundation’s Presto Foundation, a foundation that governs the community-driven PrestoDB project.
Partitioning is a great way to increase performance, but AWS Athena partition limits could lead to poor performance, query failures, and wasted time trying to diagnose query problems. Athena is an AWS serverless interactive service to query AWS data lakes on Amazon S3 using regular SQL. It is a low-cost service; you only pay for the queries you run. Athena engine v2 is built on an older version of Presto DB (v 0.217), and developers use Athena for analytics on data lakes and across data sources in the cloud. A common limitation that could lead to poor performance is Athena’s partitioning.
Partitioning data
Partitioning splits your table into parts and keeps the related data together based on column values. Partitions are like virtual columns that help the system to scan less data per query. Queries that constrain on the partitioning column(s) will run substantially faster because the system can reduce the volume of data scanned by the query when using filters based on the partition.
Athena Hive partitioning
AWS Athena supports Apache Hive partitioning. You can partition your data by one or more columns. For example you can partition based on time, which can lead to a multi-level partitioning scheme. It means that you then have your data organized by key values that may look like this: …/month=01/day=01/year=2021/… This is perfect if your queries constrain on year and/or month and/or day.
Athena has the MSCK REPAIR TABLE command which updates the partition metadata stored in the catalog. It is an inefficient command when there are a large number of partitions however. The more partitions you have, the slower this command runs. This command gathers metadata by scanning all the partitions in the specified table and updates the metadata in the Hive metastore accordingly so the system knows about any new partitions. However the command could time out after 30 minutes if it has a lot of work to do, this being Athena’s default query time limit. The command should run to completion so that all the partitions are discovered and cataloged, and it should be run every time new partitions are added e.g. after each ETL/data ingest cycle.
AWS Glue partitioning
AWS Glue is an Extract-Transform-and-Load (ETL) service that has a central metadata repository called AWS Glue Data Catalog. You can use it to perform ETL operations and store metadata to enable data lake querying. Also, it helps you to partition your data; you can create a Glue table of an Amazon S3 folder, which is a partitioned table that has key values in its path like Hive. Automated Glue crawlers help you to keep the Glue catalog in sync with Amazon S3, which is something you can’t do with Hive.
AWS Athena partition limits
If you are using Hive metastore as your catalog with Athena, the max number of partitions per table or the Athena partition limit is 20,000. You can request a quota increase from AWS.
If you are using AWS Glue (partition limits apply) with Athena, the Glue catalog limit is 1,000,000 partitions per table.
AWS Athena alternatives with no partitioning limitations
Open Source PrestoDB
Deploying PrestoDB on your own is one way to avoid Athena’s partitioning limitations. PrestoDB doesn’t have a hard partition limit, which helps boost your performance, and you are free to deploy the latest version of Presto and so benefit from all its features.
PrestoDB has the Hive system.sync_partition_metadata function to update partitions in metastore; it works better than the MSCK REPAIR TABLE command that AWS Athena uses. Because AWS Athena built on an older version of PrestoDB, it doesn’t have this function.
If you’re ready to take advantage of the full power of PrestoDB, you can use Ahana Cloud – the managed service for Presto in the cloud. Ahana Cloud for Presto is a fully managed PrestoDB cloud service for AWS to query and analyze AWS data lakes on Amazon S3 and many other data sources using PrestoDB.
Ahana runs as a SaaS offering and gives you the ability to deploy and manage Presto at any scale without having to manage PrestoDB or the underlying containerization, and without the restrictions of Athena’s partitioning or concurrency limits. You can easily and quickly deploy multiple Presto clusters, attach data sources, and scale clusters in and out on demand in seconds. You can use Hive or Glue catalogs with Ahana Cloud, and it has its own integrated, fully managed Hive metastore for convenience.
Check out the case study from ad tech company Adroitts on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.
Welcome to the 2nd blog in our blog series on comparing AWS Athena, a serverless Presto service, to open source PrestoDB. In this series we’ll discuss Amazon’s Athena service versus PrestoDB and some of the reasons why you might choose to deploy PrestoDB on your own instead of using the AWS Athena service.
Amazon Athena allows users to perform ad-hoc analytics on data lakes without the need for time-consuming Extract Transform Load (ETL) cycles. In order to maintain the availability of the service when processing such vast and diverse data, certain design decisions and limitations were introduced. Some of these are:
Inherent limits
Query string limit: 262,144 bytes.
Some statements like EXPLAIN, MERGE, and UPDATE are not supported.
Only one query per user is permitted, and 5 to 200 queries per account.
Queries can timeout on tables with thousands of partitions.
Cross-region querying is limited to sixteen regions.
AWS Athena partition limits happen because a Create Table As (CTAS) or INSERT INTO query can create up to a maximum of 100 partitions in a destination table. To work around this limitation you must use a series of INSERT INTO or CTAS statements that insert up to 100 partitions each.
Query timeouts can occur on tables with many thousands of partitions. This can happen when the table has many partitions that are not of type string.
Athena concurrent queries limit occurs due to limits imposed on accounts to avoid users from overwhelming the regional service.
Configurable quotas
Queries can be queued or canceled by the Athena service when exceeding a data threshold.
An adjustable quota of 20 DDL (Data Definition Language, like CREATE TABLE) active queries (600-minute timeout) and 20-25 DML (Data Manipulation Language – like SELECT and CREATE TABLE AS) active queries (30-minute timeout) is permitted.
Maximum 20,000 partitions per table.
Maximum number of databases is 100. Each database can contain a maximum of 100 tables.
Other limitations
Full-table scans are necessary due to the lack of indices.
AWS Athena’s federated queries cannot be used with views, and the required Athena Data Source Connectors (that run on AWS Lambda) are complicated to setup.
Nodes can run out of memory when sorting columns.
There is no default support for stored procedures (but can be added with an adapter).
Partition creation is restricted to 100 at once (workarounds exist).
The same server pools run queries for multiple accounts, so queries tend to get enqueued based on the service load.
..and much more. You can read more details about AWS Athena’s limitations in my other blog post on the topic.
Configuring and fine-tuning Presto or working around the limitations of Amazon Athena takes a lot of experience. A solution like Ahana Cloud can help.
Ahana Cloud for Presto is a fully managed PrestoDB cloud service for AWS to query and analyze AWS data lakes stored in Amazon S3, and many other data sources. Unlike Athena, it uses the latest version of PrestoDB. Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, speed and ease of use. Ahana is a SaaS offering with an easy to use console UI. Anyone at any knowledge level can use it with ease, there is zero configuration effort and no configuration files to manage. Many companies have moved from AWS Athena to Ahana Cloud. You can try Ahana Cloud today as a free trial.
Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.
Ahana Cloud for Presto 🌤 – Now “Open” for “Open Analytics”
I’m thrilled to share that today Ahana Cloud for Presto on AWS is generally available and open for Open Analytics. Ahana Cloud 🌤 is the first fully integrated, cloud-native managed service for Presto that simplifies the ability of cloud and data platform teams of all sizes to provide self-service, SQL analytics for their data analysts and scientists.
When we announced the Early Access for Ahana Cloud this past September, we had no idea of the phenomenal response we would get for our managed service. Before I get into the big improvements since our early release, a huge thank you 🙏🏼 to all the users that participated in the program and for all the valuable feedback particularly on making PrestoDB easier to manage and operate. And we are really sorry if we missed you during the early access. But today Ahana Cloud for Presto is available to everyone! So give it a try.
I’ve blogged and written quite a bit about open analytics this past year. After talking with hundreds of users who are building their next generation analytics platform on the data lake, I firmly believe that we are seeing an alternative architecture – an open, more flexible approach to augment the cloud data warehouse approach – where users may get locked-in to proprietary data technology and formats.
Companies like Facebook, Uber, Twitter, and Alibaba adopted this loosely coupled disaggregated stack that enabled querying across many databases and data lakes became the dominant standard for their analytics – with the tightly coupled data warehousing approach relegated to legacy workloads.
This new architecture is what I’m coining Open Analytics, and it’s built on Presto. And that’s why we built Ahana Cloud – with a mission of simplifying data lake analytics with Presto.
Ahana Cloud brings a SaaS open source analytics option to users with Presto at its core, using open formats and open interfaces. As the first fully integrated, cloud native managed service that simplifies the ability of cloud and data platform teams to provide self-service, Ahana Cloud brings the power of Presto to any data team of any size and skill level. And it is thrilling to see quite a bit of customer adoption just in our early access program.
What’s new in Ahana Cloud?
With this announcement I’m pleased to share some of the latest innovations on the platform.
AWS Marketplace Pay-as-you-Go (PAYGO): Ahana Cloud for Presto is elastically priced based on usage, with PAYGO and annual options via AWS Marketplace, starting from $0.25 per Ahana cloud credit. Try out our pricing calculator.
Better logging and monitoring integration
Query the Presto Query Log with Presto – We decided the best way to experience the power of Presto is for us and our users to try out Presto on data generated by Presto. The Presto query log is automatically integrated into your account and stored in an S3 data lake. We now have an option to integrate this query log and attach it as an external table to an Ahana-managed Hive Metastore
Presto coordinator logs and Apache Superset logs are integrated into Amazon CloudWatch for each access and integration into your monitoring system
Improved security
We have added mandatory password protection on all Presto clusters created and enabled HTTPS-only access. No Http access is allowed to the cluster. This is very critical given how easy it is to add data sources to Ahana Cloud with a few clicks.
Data catalog optionality
The Ahana-managed Hive metastore is now optional. You can also bring your own Amazon Glue Catalog or user-managed Hive Metastore. This means lower costs so that you only pay for what you actually need and use. When a cluster is stopped and moves to an inactive state, if there is no attached Ahana-managed Hive Metastore – costs go down to ZERO. On cluster start, all cluster config including data catalogs attached are preserved giving users a one-click start.
In addition, users continue to benefit from:
Easy-to-use Ahana SaaS Console for creation, deployment and management of multiple Presto clusters within a user’s AWS account bringing the compute to user’s data
Support for Amazon Simple Storage Service (Amazon S3), Amazon Relational Database (Amazon RDS) for MySQL, Amazon RDS for PostgreSQL and Amazon Elasticsearch
Click-button integration for user-managed Hive Metastores and Amazon Glue
Apache Superset Sandbox for admins to ensure connectivity and access
Cloud-native, highly scalable and available containerized environment deployed on Amazon EKS
Additionally, we also announced a go-to-market solution in partnership with Intel which includes an Open Data Lake Analytics Accelerator Package for Ahana Cloud users. It leverages Intel Optane on the cloud with AWS.
Ahana Cloud is available in AWS. You can sign up and start using our service today for free.
What’s next
Today, in under 30 minutes you can be up and running with Presto and Ahana Cloud. The combination of a cloud managed service with federated SQL analytics has so far proven to be a game-changer for our customers.
We look forward to continuing to innovate Ahana Cloud for our customers and to deliver a seamless, easy experience for data teams looking to leverage the power of Presto.
PrestoDB is growing in popularity among data teams. The need for data engineers, data architects, data scientists, and analysts to run interactive, ad hoc analytics on large amounts of data continues to grow explosively. Data platform teams are increasingly using the federated SQL query engine PrestoDB to run such analytics for a variety of use cases across a wide range of data lakes and databases in-place, without the need to move data. PrestoDB is hosted by the Linux Foundation’s Presto Foundation and is the same project running at massive scale at Facebook, Uber and Twitter. Presto analytics are becoming very popular.
Let’s look at some important characteristics of Presto that account for its growing adoption.
Easier integration with ecosystem
PrestoDB was designed to seamlessly integrate with an existing data ecosystem without any modification needed to the on-going system. It’s like turbocharging your existing stack with an additional faster data access interface.
Presto provides an additional compute layer for faster analytics. It doesn’t store the data, which gives it the massive advantage of being able to scale resources for queries up and down f based on the demand.
This compute and storage separation makes the Presto query engine extremely suitable for cloud environments. Most of the cloud deployments leverage object storage, which is already disintegrated from the compute layer, and auto-scale to optimize resource costs.
Unified SQL interface
SQL is by far the oldest and the most widely-used language for data analysis. Analysts, data engineers and data scientists use SQL for exploring data, building dashboards, and testing hypotheses with notebooks like Jupyter and Zeppelin, or with BI tools like Tableau, PowerBI, and Looker, etc.
PrestoDB is a federated query engine that has the ability to query data not just from distributed file systems, but also from other sources such as NoSQL stores like Cassandra, Elasticsearch, and RDBMS and even message queues like Kafka.
Performance
The Facebook team developed PrestoDB because Apache Hive was not suitable for interactive queries. Hive’s underlining architecture , which executes queries by executing multiple MapReduce and Tez jobs, works very well for large, complex jobs, but does not suffice for low-latency queries. The Hive project has recently introduced in-memory caching with Hive LLAP; however it works well for certain kinds of queries, but it also makes Hive more resource-intensive.
Similarly, Apache Spark works very well for large, complex jobs using in-memory computation. However, it is not as efficient as Presto interactive BI queries.
Presto is built for high performance, with several key features and optimizations, such as code-generation,in-memory processing & pipelined execution. Presto queries share a long-lived Java Virtual Machine (JVM) process on worker nodes, which avoids overhead of spawning new JVM containers.
Query Federation
Presto provides a single unified SQL dialect that abstracts all supported data sources. This is a powerful feature which eliminates the need for users to understand connections and SQL dialects of underlying systems.
Design suitable for cloud
Presto’s fundamental design of running storage and compute separately makes it extremely convenient to operate in cloud environments. Since the Presto cluster doesn’t store any data, it can be auto-scaled depending on the load without causing any data loss.
As you can see Presto offers numerous advantages for interactive ad hoc queries. No wonder data platform teams are increasingly using Presto as the de facto SQL query engine to run analytics across data sources in-place, without the need to move data.
Ahana is the startup that provides the first cloud-native managed service for Presto, the fast-growing, open source distributed SQL engine. Backed by GV (formerly known as Google Ventures) and Lux Ventures, the Ahana team includes experts in Presto, AWS, and big data. This blog post discusses how AWS users have evolved their big data requirements and how the team architected our managed service offering, highlighting the best practice of providing an “In-VPC” deployment. We hope other infrastructure software startups can benefit from sharing some of the key learnings that led to the launch of Ahana Cloud for Presto on AWS.
For some more background, Presto is an open source system for federated data analytics. Federation means the system can map multiple data stores. It enables users to access data where it lives in a wide variety of sources via federated plug-in connectors without moving or copying the data. Presto was originally developed by Facebook. Today, it’s deployed in large-scale production at some of the world’s most data-driven companies, including Uber and Twitter. Presto addresses the business need of leveraging all data within an organization to generate insights and drive decision-making faster than ever before. Presto also leads in delivering on the technology trends of today: disaggregation of storage and compute, resulting in the rise of Amazon S3-based data lakes and on-demand cloud computing. You can learn more on the AWS Presto page.
While the SQL engine is the main component of an interactive ad hoc analytics system, the other components, such as the metadata catalog, the data sources, and the visualization tools or notebooks, require integration. Deploying and managing complex software in AWS can be challenging. Presto administrators must set-up, configure, and connect one data store for Presto’s metadata. Typically, this is Apache Hive or AWS Glue. They must also create and configure their connectors to access their data sources and then configure catalog entries for each data source. Presto requires the admins to deal with many properties files to achieve this, which is both laborious and error-prone.
Ahana Cloud for Presto addresses these complexities and more with an easy-to-use cloud-native managed service. In 60 minutes or less, Ahana allows users to build an end-to-end deployment: multiple clusters of Presto, their Glue or Hive metadata catalogs, their AWS data sources, and user-facing tooling. Customers get the power of Presto with the capabilities of AWS for faster, more iterative, and interactive data discovery—without the complexity. Analysts, data engineers, and data scientists enjoy the freedom to rapidly use any data in the organization and do so in a more self-service way. Additionally, AWS customers can procure services the way they’re used to—quickly and easily—on an hourly pay-as-you-go (PAYGO) listing on AWS Marketplace, simply billed to their AWS accounts.
The “In-VPC” Deployment Approach that Data-Driven Customers Want
As cloud service adoption has grown, the way companies store and analyze their data has evolved. Early adopters were focused around innovation: building and deploying applications quickly with AWS and other public cloud providers. Most of the mission-critical data was still produced and analyzed in data centers, mainly due to control-related concerns of the data, such as where that data could be copied, how it could be used, and who could access it. Now, as cloud adoption has become mainstream, we see companies with the majority of their data both created and stored in the cloud, especially in cost-efficient Amazon S3-based data lakes. Along with this shift, so have the concerns related to how and where data is sent, its use, and access controls. Users do not want to lose control of their data; they prefer to not have to ingest it to other environments. They want data to remain in their own Virtual Private Cloud (VPC).
A new cloud-native architecture model has emerged for data-focused managed services like Ahana. We call it the “In-VPC” deployment model, separating the control plane from the compute and data planes.
The Role of the Control Plane
Ahana Cloud has two major “planes,” the control plane which is delivered as a SaaS, and the compute plane where Presto clusters run, which is delivered as a managed service. The Ahana Control Plane, just as it sounds, oversees, orchestrates, and manages the rest of the environment. The control plane runs in its own VPC, in the Ahana account separate from the customer account VPC, where the compute plane and data live. This makes management much easier without the customers having to share control of user data with Ahana. This is important as users want their data to remain in their own VPC and not be ingested in any other environment (e.g., some 1st gen cloud data warehouse services). In fact, the Ahana control plane running in the Ahana VPC never sees any of the customer’s data; it is totally separate from the customer’s “In-VPC” compute plane deployment.
Integrated Metastore
For further ease-of-use, Ahana pre-integrates an Apache Hive metastore/catalog, which is automatically created, so it’s not essential to set-up other components like AWS Glue. But if users have an existing metastore including Glue, they can use that if they prefer.
Connectors Included
In terms of connectors, Ahana initially ships with support for AWS data services like Amazon S3 and Amazon RDS for MySQL and PostgreSQL, and others. More connectors for sources like MongoDB and Amazon Redshift will follow soon. Ahana automates the creation of connections and catalogs, removes the need to juggle configuration files, and eliminates the need for Presto restarts. Catalogs can be created once and used by multiple clusters.
In the diagram, there are two core components, both created and managed by Ahana Cloud:
1. The Ahana control plane (top) and its UI orchestrates the Presto environment. There’s consolidated application logging, query logging, and monitoring, which means users have full and easy management and control. There are security and access controls and pay-as-you-go hourly billing and support.
The control plane runs in the Ahana Amazon account, external to the user’s environment.
Ahana and its employees have no access to the user’s data.
It is multi-tenant to scale with customer accounts.
The control plane supports SSO with Amazon Cognito, LDAP authentication, and SQL-based authorization for Presto (RBAC). In the future, there will be Apache Ranger support.
2. The Ahana compute plane (bottom) runs in each user’s VPC, deployed as a single-tenant environment within the user’s account. The control plane first creates a dedicated VPC for the compute plane. It then deploys Amazon EKS for a highly elastic, highly available environment to create Presto clusters. Once the control plane completes the initial set-up of the compute plane, users can create and manage any number of Presto clusters, which then get provisioned into the compute plane in Amazon EKS.
The compute plane, and the user data it interacts with, runs in the user’s account.
Each cluster is created in an individual node group to utilize the most advanced autoscaling and high-availability capabilities EKS provides.
Each Presto cluster comes pre-integrated with a Hive Metastore to store metadata for schemas and tables generated via Presto and an Amazon S3 data lake where data inserted into tables gets stored.
In addition to the pre-integrated catalog and Amazon S3 bucket, users can attach external Hive Metastores or AWS Glue catalogs pre-populated with metadata for structured data stored in Amazon S3 and databases running on Amazon RDS for MySQL or PostgreSQL.
This separation of the control, compute, and data planes is enabled by Amazon’s recommended approach of cross-account access via external ID – a mechanism that uses trusted secure token exchange. Users simply update their policy to include the Ahana ARNs (Amazon Resource Names). The In-VPC deployment approach offers greater security and cleaner management to users. For further details, we recommend this AWS blog on architecting successful SaaS services.
Summary
Ahana Cloud for Presto is the fully managed, end-to-end environment for Presto. It gives users an interactive multi-cluster UI with single-click cluster and data source management. It provides automatic set-up, security features, and resilience features. It leverages the “In-VPC” deployment, which separates the control, compute, and data planes for customers. Finally, Ahana is procured using a simple and affordable pay-as-you-go usage-based licensing model on AWS.
Ahana Cloud for Presto: Analytics for Disaggregated Stack
Today I am incredibly excited to announce Ahana Cloud for Presto, the first fully integrated, cloud-native managed service for Presto – that simplifies the ability of cloud and data platform teams of all sizes to provide self-service, SQL analytics for their data analysts and scientists. Before I share more about what Ahana Cloud for Presto does, I’d like to share why we built it and the problems it solves.
Data warehousing emerges
Data warehousing emerged in the 90s as the internet was surging in popularity and data was burgeoning. The new competitive, constantly-changing global economy required greater business intelligence, and there was a broad realization that data needed to be integrated to provide the critical business insights for decision-making. Teradata, Oracle, Microsoft SQL Server and IBM DB2 warehouses exploded. That requirement for greater business insights has only grown since and Snowflake’s tremendous IPO last week showed how data warehousing has evolved and now moved to the cloud.
The first step of data warehousing involves ingesting all data continuously and constantly into a single database.
Once data is in the data warehouse, you can query it and report on it. Typically, these systems are closed source, with data stored in proprietary formats. Because of the technology and data lock-in, these systems are also very expensive.
An alternative architecture arises – the open federated, disaggregated stack
Over the past 5 years, while the traditional data warehousing approach of a tightly coupled database continued to be adopted, an alternative approach started to be widely adopted by the most innovative technology companies – Facebook, Twitter, Uber, Netflix and others. A loosely coupled disaggregated stack that enabled querying across many databases and data lakes became the dominant standard for their analytics – with the tightly coupled data warehousing approach relegated to legacy workloads.
This new SQL analytics stack is made of 4 elements – the query engine, metadata catalog, transaction manager, and storage engine. And Presto has emerged as the defacto query engine. Presto is a federated, distributed query engine created and open sourced by Facebook. It is designed to be extensible and pluggable, which led to its extensive connector ecosystem.
But why? Why is this disaggregated stack with Presto as the foundation the preferred choice for the most advanced technology companies who can afford to buy products off the shelf?
First, this federated, disaggregated stack addresses the new realities of data
There is just too much data being generated and a single database is no longer the solution to support a wide range of analytics
Data will be stored in data lakes, but other pertinent data will still reside in a range of other databases
SQL analytics is needed for both the data lake where raw data resides in cheap storage as well as the broad range of other databases data continues to live in
Second, this federated, disaggregated stack is open
Open source – PrestoDB under the Linux Foundation is completely open source under the Apache 2.0 license. This means that you benefit from the best innovations, not just from one vendor but from the entire community.
Open formats – PrestoDB doesn’t use any proprietary formats. In fact, it supports most of the common formats like JSON, Apache ORC, Apache Parquet and others.
Open interfaces – PrestoDB is ANSI SQL compatible. Standard JDBC / ODBC drivers can be used to connect to any reporting / dashboarding / notebook tool. And because it is open source, language clauses continue to be added in and expanded on.
Open cloud – PrestoDB is cloud agnostic and because it runs as a query engine without storage natively aligns with containers and can be run on any cloud.
Technology companies prefer this open approach compared to the proprietary formats and technology lock-in that come with the traditional data warehousing approach.
Why isn’t the open federated, disaggregated stack with Presto ubiquitous?
This is the question I asked myself over the past year. As I talked with hundreds of data engineers and platform engineers, it became crystal clear.
The power of Presto is fantastic, but still out of reach of many platform engineering teams who may not have the time or skills required to manage Presto. Born in the Hadoop world, Presto is still complex. It’s a distributed data system with extensive configuration, tuning, integration and management required. Managing it on top of containers and systems like Kubernetes makes it even more challenging. While some companies – particularly large Internet ones – enable self-service SQL analytics across many data sources, including both data lakes and databases, many others have not yet been able to do so given the complexity of these activities. This is what Ahana Cloud solves.
Introducing Ahana Cloud
Ahana Cloud for Presto–the first fully integrated, cloud native managed service–simplifies the ability of cloud and data platform teams to provide self-service, SQL analytics for an organization’s analysts and scientists.
Easy
0 to Presto in 30 minutes including the in-VPC AWS service in your own account with the Ahana Console
Fully integrated
Ahana comes with a built-in catalog and easy integration with data sources, catalogs and dashboarding tools
Cloud Native
Runs on Amazon Elastic Kubernetes Service for high scalability, availability and manageability
This is just the beginning for us at Ahana. The combination of a cloud managed service with federated SQL analytics has opened up an enormous set of possible innovations to simplify analytics for platform teams – with the eventual goal of being self-managed and self-healing. I am excited that Ahana makes the power of Presto widely accessible and achievable to data platform teams of every size and, at the same time, contributes back to the community and open source.
Are you ready to go from 0 to Presto in 30 minutes?
Bridging the gap to bring two Presto communities together: Welcoming Starburst Data to the Presto Foundation
Steven Mih, Co-Founder & CEO
This week at Ahana we announced our company launch and vision to further grow and evangelize the PrestoDB community alongside the Linux Foundation and Presto Foundation with founding members Facebook, Uber, Twitter, and Alibaba. Also this week, Starburst Data shared a blog announcing that they joined the Presto Foundation.
As news of Ahana and PrestoDB circulated in outlets like ZDNet, Datanami, and many more, we kept hearing the same question come up, one that community members like Thomas Spicer at OpenBridge asked in his recent blog: Why are there two Presto projects and how many do we need?
To provide context, today there are two separate Github repos, two Slack channels, two websites for Presto, and two foundations. First, there’s the original PrestoDB with Linux Foundation’s Presto Foundation. Second, there’s the similarly-named fork PrestoSQL with Presto Software Foundation, which was started and controlled by the new co-founders of Starburst Data. Whoa.
You may be thinking, “wow, what a hot mess!” And you wouldn’t be alone. I’ve talked with many developers who feel similar and just want to code without all the confusion! Gale Hashimoto & Chiara Portner at Hopkins Carly recently blogged about open source project naming conventions and how they often advise developers.
Fortunately, the situation is looking like it will soon be resolved.
The Linux Foundation is one of the most experienced organizations in helping bring together developer communities. Arguably Linux Foundation is the spiritual center of open source, along with Apache Software Foundation. And the Linux Foundation has achieved unity numerous times, some examples being with Linux itself and with container image formats via the Linux Foundation’s CNCF.
While some of this may feel like “yawn-inducing inside-baseball”, in my view it matters a whole lot. Just look at what the Linux and Kubernetes projects have achieved for the greater good of developers worldwide.
Transparency is one of the key tenants of the Linux Foundation’s Presto Foundation (see their three main principles in the image below). Underlying the principle of a united community is an idea that software development needs to thrive, and that isn’t the case when efforts are duplicated across multiple project code bases.
Since late December of last year, I’ve been aware of many meetings and continuing efforts between Linux Foundation’s Presto Foundation and the Presto Software Foundation to align with the above principles. Over the last few days however, a breakthrough in bridging these two communities has occurred. While we won’t know the reasons for some time, I suspect that Ahana may have been the catalyst.
I look forward to welcoming Starburst to the Presto Foundation and for the benefit of the community, I hope we can see the confusion end with: 1 foundation, 1 primary code base for new development, and 1 community.
Getting Perspective on Open Source Projects and Forks with Ted Dunning, Apache Software Expert
We spent some time with Apache Software Foundation board member, PMC member, and open source expert Ted Dunning to get his thoughts on the state of open source software.
Can you start by telling us a bit about yourself and your open source background?
I’ve been involved with machine learning and big data systems for a very long time and have spent even longer working in the open source world. The first open source I was involved with was in the mid-70’s, which at the time was guys trading floppy disks in a coffee shop parking lot in Colorado. More recently I’ve been with the Apache Software Foundation (ASF) since 2007.
What’s your general philosophy on open source and the community behind an OSS project?
In my view, it’s not just one part or a few individuals that create an open source community – it’s the community as a whole. Sure, you might have a few more vocal folks, but an open source community is only as good as its least vocal member. And the community is a great thing because it gives people the ability to be part of a community – a wonderfully electrifying experience. It’s a unifying experience, and it is the ultimate marketing.
There are cases where people pretend to be open but really won’t really let anyone else play. If a community is done cynically, it can turn into vast bitterness. And it’s very common for companies to say “we employ the creators of this project”, but in fact I think that it is the community in aggregate who wrote it. Yes, a few people may have written the first lines of code and may have gotten the ball rolling, but we have to remember that the project is not just the source code. There are always a lot of different contributions to the creation of a project.
And you’ve been a part of several open source companies, MapR being the latest as the CTO. How did you strike the balance of open source and proprietary?
MapR was actually very, very interesting because it had open and closed parts. But they were good about finding a balance – it wasn’t just all proprietary software or all open source. I tried to be very publicly clear about our decisions about open sourcing some features and keeping others proprietary. One of the challenges of a vendor-backed open source technology is that you’re going to have a culture gap inside the company. You have people come from closed source, and you have people that come from open source, and neither side necessarily understands the other. But despite that, I felt much better being straightforward and honest about our decisions rather than pretending that everything was open.
When it comes to building open source communities around projects, I’m sure you’ve seen many times that things can go wrong. In some cases that might mean a project splits off, or forks, from the original project. In your experience, what can that mean for a community?
It’s so sad when communities split up. When these destructive forks happen, quite frankly it’s a huge tragedy for open source. The common characteristics of these cases are that a) everyone is just confused and b) there typically are major differences in worldviews of the same events on either side of the fork, making it very hard to get people together when there’s no shared understanding of what was happening. I see these two specific things happen when it comes to bad forks:
Both projects keep the same name or too similar of a name, so no one knows the differences between the two
Splitting the project splits the community, and this may render both irrelevant
If you split a community, a mailing list, a slack channel, there’s a reverse of synergy effect. A lot of good people who want to contribute to something but don’t really like conflict end up just leaving. People have plenty of distractions in their lives. And they have plenty of other opportunities to be creative, fun, and happy. Then like 10 years later they go, “Oh, I remember that, wonder what happened to it?”
What’s a specific example of a fork that went wrong?
Sun’s Open Office/LibreOffice project split was a classic fork that split the community. As Sun’s contributions to the project declined, there were concerns from the Open Office community over whether it would remain under a neutral foundation. When Oracle acquired Sun, discussions of a fork began and as a result, LibreOffice was created and the community was split between the two projects. Unfortunately, it created a lot of conflict and negativity which harmed both products and the Open Office/LibreOffice community at large. Much of the community stopped contributing and maintaining either.
The Hudson/Jenkins fork is a fork that had a similar potential for problems, but essentially the entire community left the original Hudson branch and moved over to the Jenkins fork. Today, it is hard to see that there ever was a fork. Things that went right included distinctive naming (Hudson versus Jenkins) and a strong consensus about which fork to stay with moving forward (Jenkins). The original Hudson project moved to the Eclipse foundation, but has been dormant since 2017, leaving room for Jenkins to flourish.
What about projects that keep too similar of a name? We’re seeing that play out today with the Presto open source project with the original PrestoDB and the similarly-named fork PrestoSQL. What are your thoughts on this?
My initial reaction with Presto is that I always have to go look up which one is which. There’s PrestoSQL, PrestoDB, Presto Foundation, Presto Software Foundation and I can’t keep them straight. And so there’s massive confusion unless you’re one of the participants – as an outsider, I have no idea which one is which. From what I understand about the project, my guess is that the Presto creators who worked at Facebook didn’t make it apparent that they were going off to start a new community. They may not have been as open as they could have been about this, as sometimes happens when people leave a company. Later Facebook decided to put more effort behind the community, and it caused confusion because there’s now effectively two communities with nearly identical names.
But I think that the moral of the story is that when you name something, names matter. Anybody who forks should have a completely different name. That alone makes it easier for everyone to understand. Of course, it isn’t always clear which is the fork and which is the original, but at least you’d know if it was Presto in one place and Abracadabra in the other. Everybody would get it. It doesn’t even matter all that much which side gets which name.
Introducing Ahana
Dipti Borkar, Co-Founder & CPO and Steven Mih, Co-Founder & CEO
Meet Ahana
There are many factors that contribute to strong co-founder relationships. Forbes refers to one set and defines them as the 3Ts: The first one being ‘trust’, which is self explanatory, second being ‘talk’, that is the ability to talk things out even if you disagree, and the third being ‘target’, that is having one mission with aligned goals. While it’s easy to put this into a framework, getting this right is something rare. For us, it took nearly 10 years, joint experiences at two companies and working through numerous tough situations and tough conversations. The outcome is Ahana.
While we’re heads down working on Ahana, we’d like to share our vision and the problem we intend to solve. Our vision is to simplify the interactive, ad hoc analytics journey for users. But hasn’t this problem been solved already? Let’s walk through the evolution of data systems for a minute or two.
An architectural shift
It has been nearly 15 years since the Google MapReduce paper was published (2004) and Amazon Web Services was launched (2006). These were key milestones that marked the beginnings of two of the biggest trends in enterprise software that continue on even today. In these 15 years a lot has changed. Data is stored in a lot more places than the three early relational databases (IBM DB2, Oracle, Microsoft SQL Server). In fact with polyglot persistence, organizations have different data systems for each use case. What this also means is that the metadata that used to be unified as the star schema of the sacred data warehouse, is now also spread across a variety of data sources.
In addition, the five key components of a database as defined in database textbooks are now independently running pieces of software in the big data stack. The database stack is completely disaggregated – the query processor (examples: Apache Hive, Apache Spark, Presto), the metadata catalog (examples: Hive Metastore, AWS Glue), the storage engine (examples: object stores, AWS S3, Google GCS, Azure ADLS, other RDBMSes) and even the transaction manager (examples: Apache HUDI, DeltaLake). This separation brings great flexibility at the cost of tremendous complexity.
The largest internet companies have some variation of these components in production at massive scale. Big platform teams with the brightest engineers work on integrating these components together, innovate on them, deploy them and maintain them. And while it is complex and resource intensive, they get tremendous value by enabling data-driven decisioning with ad hoc analytics platforms.
But every company and organization should have the ability to make interactive, ad hoc data-driven decisions without the need to integrate, manage and deploy a complex stack.
Data Federation makes a come back
The heart of the data stack is the query processor, the query engine. And the good news is that there has been immense innovation in the open source community on modern standalone query engines. In fact, the idea of a separated query engine isn’t new. Data Federation has had many iterations over time.
Data Federation 3.0 made a comeback with Google’s Dremel paper in 2010. Couchbase, where we both worked for many years, implemented SQL++, another federated query engine designed at the UC San Diego database lab. Presto was designed and built at Facebook and open sourced in 2013. Since then, its adoption has simply exploded.
Facebook and the creators of Presto – Martin, David and Dain, who were engineers there, have built a nicely designed distributed system (Presto IEEE paper) – highly modular, pluggable and extensible. While there is room for improvement (like moving away from a single coordinator node, better resource management, a more advanced planner that reads less data and pushes down more work), Presto has become the de facto choice for interactive, ad hoc querying on a disaggregated stack.
Community-driven Presto and Presto Foundation
We have both learned from our enterprise software experience that developers and engineers primarily care about solving problems in order to get things done. As the use of an open source software grows in importance to an organization, the developers also care about a project’s transparency, openness and neutrality. Open source software has moved from its fringe beginnings to the forefront of technology innovation. While both of us have participated in the evolution of open source commercialization from support-only models to open-core based proprietary subscription models even extending to infrastructure software as a service, the common underlying factor to all these models is the requirement of a strong, vibrant open source community. Achieving that requirement requires more than simply an open source license like Apache 2.0. It takes a thoughtful, transparent, and authentic approach in all interactions. We believe that is open source done right.
At Alluxio, where we both worked recently, we became very involved with the open source Presto community. Dipti presented many joint talks with founders of Starburst Data as well as pushed out product offerings integrated with Starburst Presto. Steven engaged with many end user companies to evaluate and deploy Presto with Alluxio into production environments.
We soon realized that there was a lot more to the Presto community than we knew. There were in fact two separate Github repos, two slack channels and two websites. All things very, very confusing for any open source community. Then in September 2019, Facebook donated the original project, PrestoDB to the Linux Foundation, to further grow and evangelize the community under an established open source governance model similar to CNCF and Kubernetes. We joined the Presto Foundation (Steven, a member of the Governing Board and Dipti, as chairperson of the Outreach Committee) to evangelize and support Presto.
A new adventure begins
We had talked about founding a company a few times in the past, and seeing the ever-growing problems with disparate data systems, combined with the federated query engine returning to the forefront, we have embarked on a new venture. We believe that data federation 3.0 with Presto will become the architectural foundation to meet the needs of modern data teams.
So here we are. We are excited to share that Ahana has raised $2.25 million in funding led by GV (formerly Google Ventures) along with participation from Leslie Ventures and other angel investors. We’re thrilled to have Dave Munichiello from GV to be our lead investor. He and the whole GV team have continued to be fantastic in their support of our vision. We are excited to build out our technical team and deliver simplified Presto-based analytics products for every organization. Stay tuned in by joining our All Things PrestoDB newsletter.
We’ve worked together at two companies, we have a great working relationship, and we’re passionate about bringing open source products to market. This time, we’re thrilled to start from the ground up, as friends and as co-founders of Ahana. Cheers!
San Francisco, April 2020 – Dipti Borkar and Steven Mih sosh-dist-celebrating the closing of the Ahana seed round in Steven’s garage office with the leaf blower