How does the Price-Performance ratio of AWS Athena Presto compare to Ahana Cloud for Presto?

First, Overall Differences

Both AWS Athena and Ahana Cloud are based on the popular open-source Presto project which was originally developed by Facebook and later donated to the Linux Foundation’s Presto Foundation. There are a handful of popular services that use Presto, including both AWS Athena and Ahana Cloud. 

The biggest difference between the two is that AWS Athena is a serverless architecture while Ahana Cloud is a managed service for Presto servers. The next biggest difference is the pricing model. Instead of paying for the amount of compute used by AWS Athena, you pay by the amount of data scanned. On the other hand, Ahana Cloud is priced by the amount of compute used. This can be a huge difference in price/performance. Before we get into the price-performance specifically, here’s an overview of the comparison:

AWS Athena (serverless Presto)Ahana Cloud for Presto (managed service)
CostPay only for the amount of data is scanned: 
USD $5 per Terabyte Scanned
Pay only for the infrastructure resources used
Cost effectivenessOnly pay while the query is scanning, not for idle timesOnly pay for EC2 and Ahana Cloud while compute resources are running, plus ~$4 per day for the managed service
ScaleAWS Athena can scale query workloads but has concurrency limitsAhana easily can scale query workloads without concurrency limits
Operational overheadLowest operational overhead: no need to patch OS – AWS handles thatLow operational overhead: no need to patch OS – Ahana Cloud handles that and the operation of servers
Update frequencyInfrequent updates to the platform. Not current with PrestoDB, over 60 releases behind.Frequent updates to the platform. Typically, two releases behind the current version.

Both let you focus on deriving insight from your analytical queries, as you can leave the heavy lifting of managing the infrastructure to AWS and the Ahana Cloud managed service. 

How do you define price-performance ratio?

Price–performance ratio
From Wikipedia, the free encyclopedia
In engineering, the price–performance ratio refers to a product’s ability to deliver performance, of any sort, for its price. Generally, products with a lower price/performance ratio are more desirable, excluding other factors.

For a quick an example of price-performance ratio, here’s two cars with similar performance but a very different ratio: 

McLaren P1 vs. Tesla Model S P90D

And if you wanted to compare the 0 to 60 mph performance vs. their price tag:

$1,150,000           $131,000

————— >>> ————–

2.6 seconds 2.6 seconds 

While both can get you somewhere, based on the criteria of 0 to 60, the price-performance ratio is much lower with the Tesla.

Comparing the Price-Performance Ratio of Amazon Athena vs. Ahana Cloud

For this comparison, we’ll look at performance in terms of the amount of wall-clock time it takes for a set of concurrent queries to finish. The price is the total cost of running those queries. 

Instead of using a synthetic benchmark, we’ll look at the public case study on the real-world workloads from Carbon, who used Athena and then switch to Ahana Cloud. While your workloads will be different, you’ll see why the price-performance ratio is likely many times better with Ahana Cloud. And by going through an example, you’ll be able to also apply the same method when doing a quick trial (we’re here to help too.)

Here’s a few things that the Carbon public case study showed:

  • While you cannot tell how many or type of EC2 instances that are used by Athena V2, they determined that they could get similar performance with 10 c5.xlarge workers with Ahana Cloud
  • Athena V2 would start to queue queries after there were 2 other queries running, meaning that the amount of wall-clock time was extended as a result.

AWS Athena is constrained by AWS concurrency limits

Ahana has higher concurrency so queries finish faster

  • The queries would be charged at a rate of $5/TB scanned regardless of the amount of compute used. Their 7 tests ended up scanning X TBs = $Y 
  • Ahana Cloud with 10 X c5.xlarge workers has total costs of:
TypeInstancePrice/hrQty.Cost/hr
Presto Workerc5.xlarge17 cents10$1.70
Presto Coordinatorc5.xlarge17 cents1$0.17
Ahana Cloud10 cents11$1.10
Ahana Cloud Managed Service8 cents1$0.08
Total$3.45

So, you can run for one hour any number of queries that scan any amount of data for only $3.45 compared to one query of Athena scanning one TB of data costing $5.00.

Summary

While there is value in the simplicity of AWS Athena’s serverless approach, there are trade-offs around price-performance. Ahana Cloud can help.

Ahana is an easy cloud-native managed service with pay-as-you-go-pricing for your PrestoDB deployment. 

Get started with a 14-day free trial today.

Zeppelin and Presto

What is Zeppelin?

Zeppelin is an open-source tool for producing data science notebooks that are then stored as .json files. 

This kind of analytics software is called a notebook. The interface of Zeppelin consists of two different kinds of spaces which are both called notes. You can use the first kind to type in your code, free text, or markdown-formatted annotations. Once you run the code, results appear underneath as an output.  

The outputs can be charts, query results, or a markdown preview. In Zeppelin, you can organize such notes — code and results sections — into dashboards by stacking them or placing them next to each other. 

With Zeppelin, you not only create data pipelines, but you can also visualize them as graphs to explain your ETL setup to others. If you want to make Zeppelin usable for your organization (like the marketing or sales teams, for example), you can build dynamic forms such as query templates. The users would only fill in the form and do not need to worry about the syntax and the correct structure of their queries.

Zeppelin works with a wide range of programming languages that are used in data analytics and data engineering. Among them are SQL, SparkSQL, Python, Scala, Hive, shell, and markdown. This makes Zeppelin a popular tool for data scientists and engineers working with Apache data processing tools and frameworks. 

Zeppelin offers multi-user management for collaborative work on big projects. Companies use Zeppelin to generate large amounts of structured and unstructured data that sometimes has to be stored in different databases and data lakes. You can pull data from them all into one notebook using different interpreters. There is no need to create multiple notebooks.

For querying across Zeppelin and other data sources, many have turned to Presto, an open-source distributed query engine.

What Is Presto?

Presto is an open source distributed SQL query engine that you can connect to many different data sources like databases, cloud storage, cloud storage, NoSQL databases, and many more. Many use Presto for their data lake analytics needs, as well as federated querying across a range of data sources.

A Presto cluster consists of a single coordinator and several worker nodes. The worker nodes are responsible for connecting to various sources and transparently carrying out query processing in a distributed and parallel manner. The computational power of a cluster can thus be increased by adding the number of worker nodes. This has made it an efficient choice for organizations with different data formats and sources and/or a large amount of data to process.

How Presto Works With Zeppelin

With Presto, you can query structured and unstructured data at once inside one Zeppelin note.

To integrate Presto with Zeppelin and across other data sources, you will need to add Presto as a new interpreter in Zeppelin. This can be done without any code as Zeppelin has an interface feature for it. You add a JDBC connector configured for Presto. Then, simply start writing a SQL query in a new note preceding it with the %jdbdc(presto) command to tell the note which interpreter to use. 

Since Presto uses SQL, you won’t have to invest time into learning another niche programming language. Thus, speed, scalability, and familiarity make Presto a popular tool for real-time analytics on Zeppelin.

Trifacta and Presto

What Is TriFacta?

TriFacta is a cloud-based data engineering platform that is used to leverage an organization’s data. It enables data collection and consolidation, preparation, data transformation, and creation of data pipelines to be performed easily. These data pipelines can then be used for different use cases like business intelligence (BI), machine learning, and data analytics. Organizations using TriFacta include EMC2, Deloitte, Intel, Google, the Centers for Disease Control (CDC), and Ericsson.

TriFacta is a cloud-native solution meant to make it easy to use data in different formats and structures for tasks like data analysis and machine learning. This is achieved by simplifying the tasks of data cleaning, data exploration, data preparation, data validation, and creating data pipelines since the user can use a visual interface. It is supported on different cloud platforms including AWS, Google Cloud, and Microsoft Azure.

image from trifacta.com

TriFacta is meant to be used to create data engineering pipelines in a cloud environment. This offers a chance to both accelerate data engineering and create highly scalable solutions. One way this is achieved is by using various supported cloud storage platforms such as GCP Cloud Storage, AWS S3, and databricks. The other is by using various data management platforms like Hadoop, Hive, Pig, Spark, and Presto.

What Is Presto?

Presto is an in-memory distributed query engine that is meant to provide a way to connect to disparate data sources like cloud storage solutions, SQL, and NoSQL databases. It was developed by Facebook to enable the social media company to query its HIVE data warehouse. Since then, it has been adopted by other companies including Netflix, Twitter, Slack, Amazon, and FINRA.

A Presto cluster consists of a single coordinator and several worker nodes. The worker nodes are responsible for connecting to various sources and transparently carrying out query processing in a distributed and parallel manner. The computational power of a cluster can thus be increased by adding the number of worker nodes. This has made it an efficient choice for organizations with different data formats and sources and/or a large amount of data to process.

Creating Distributed Data Engineering Solutions using TriFacta and PrestoDB

The performance of data engineering solutions developed using TriFacta affects the performance of apps that rely on the provided data pipelines. The solutions therefore must have high performance and be scalable to meet the demands placed on them and respond quickly to changes in user demand. This calls for the adoption of a highly customizable distributed architecture. Combining TriFacta and Presto enables organizations to create highly scalable, distributed, and parallel data engineering solutions.

A typical architecture consists of a presto cluster to handle data management and TriFacta applications to handle data wrangling tasks and provision data pipelines. This allows the presto cluster to be scaled by adding or removing processing nodes to meet the requirements of the TriFacta applications. Integrating them offers other benefits such as data federation, fast query processing, and being able to have different clusters that can be optimized to best meet the needs of data engineering apps.

Tableau and Presto

What Is Tableau?

Tableau is a data visualization tool that can query a wide range of data sources like relational databases, cloud databases, spreadsheets, data warehouses, and much more.

Tableau allows people to see and understand their data and provides business intelligence. Users can combine data from various data sources for analysis in dashboards. You can also manage Tableau in your own environment or with a managed service. It offers an interface suitable for both visualization beginners and experts.

image from tableau.com

Tableau is capable of creating visualizations with sophisticated data analysis. Apart from building charts, Tableau helps to uncover deeply hidden relationships in the data. It also allows you to do calculations right inside the visualization layer, which makes it easier to adjust on the fly.

Consequently, Tableau is used regularly by academia and business professionals in different fields. Data scientists use Tableau to get insights on data spread across their business. Engineers use Tableau to quickly plot data and automate reporting for other business units.. Business analysts create self-service reports for top managers.

With Tableau + Presto, a distributed SQL query engine, it can become an even more powerful tool.

What is Presto?

Great analytics requires great amounts of data and a speedy query engine. This is where Presto comes into play. 

It is called a distributed query engine since it allows you to fetch data from multiple sources inside the same query. Furthermore, Presto works with both relational and non-relational databases. Even petabytes of data won’t slow down its performance and prevent it from delivering results within seconds.

Presto is open-source and hosted under the Linux Foundation license. Anyone can download it from a GitHub repository and use it.

A few things make Presto stand out. More often than not, to query your raw data, you would need to move it to a new warehouse first. Presto works in-memory, querying the data directly where you store it. It creates an abstraction layer to transform data on the fly and dispatch it to you.

Presto runs on Hadoop, a framework for distributed data processing. Presto’s architecture is close to being a massively parallel processing (MPP) database management system but is more advanced. 

Presto has multiple worker nodes that are orchestrated by a coordinator node. The worker nodes fetch your data across sources, and the coordinator plans the execution of your query and delivers the results.

How Tableau and Presto Work Together

Traditionally, Business Intelligence tools, such as Tableau, are optimized for structured data queries using sources such as data warehouses. These BI tools were unable to support big data sources efficiently. It made for a slow (and sometimes expensive) process with a lot of manual workflows.

It’s also a manual process to connect Tableau to your various data sources, and you have to manage each one. If you connect Tableau to Presto, then you can run your queries with Presto which will query all of your data sources with one query, without the need to manage all of those connections. The performance is much better as well, especially when you have many users.

You can connect to Presto from the Connectors menu directly or by using an SQL statement. Once connected, you will be able to access live Presto data within Tableau. You can also publish completed workbooks with Presto data.

Presto and AWS S3

What is AWS S3?

Amazon Simple Storage Service (Amazon S3) is storage for the internet. Amazon S3 is used to store and retrieve any amount of data at any time, from anywhere on the web, by using the web interface of the AWS Management Console.

What is Presto?

PrestoDB is a federated SQL query engine for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data, which continues to grow exponentially across a wide range of data lakes and databases. Many organizations are adopting Presto as a single engine to query against all available data sources. Data platform teams are increasingly using Presto as the de facto SQL query engine to run analytics across data sources in-place. This means that Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. 

Why use Presto with AWS S3?

Analysts get better performance at a lower cost by using S3 with Presto, as users can scale their workloads quickly and automatically. Presto allows users to quickly query both unstructured and structured data. Presto is an ideal workload in the cloud because the cloud provides performance, scalability, reliability, availability, and massive economies of scale. You can launch a Presto cluster in minutes, without needing to worry about node provisioning, cluster setup, Presto configuration, or cluster tuning.

Presto executes queries over data sets that are provided by plugins known as Connectors. Integrating Presto with S3 provides users with several features:

  • Presto, running on Amazon EMR, allows developers and analysts to easily run interactive SQL queries that directly access data stored in Amazon S3 for data-exploration, ad-hoc analysis and reporting.
  • The Hive connector allows Presto to query data stored in S3-compatible engines and registered in a Hive Metastore (HMS). 
  • Data transfer between a Presto cluster and S3 is fully parallelized.
  • Presto can be easily deployed using the AWS Serverless platform, with no servers, virtual machines, or clusters to set up, manage or tune. 

Since Presto is based on ANSI SQL, it’s very straightforward to start using it. The Presto connector architecture enables the federated access of almost any data source, whether a database, data lake or other data system. Presto can start from one node and scale to thousands. With Presto, users can use SQL to run ad hoc queries whenever you want, wherever your data resides. Presto allows users to query data where it’s stored so they don’t have to ETL data into a separate system. With an Amazon S3 connector, platform teams can simply point to their data on Amazon S3, define the schema, and start querying using the built-in query editor, or with their existing Business Intelligence (BI) tools. With Presto and S3, you can mine the treasures in your data quickly, and use your data to acquire new insights for your business and customers.

For users who are ready to use Presto to query their AWS S3 but don’t want to worry about the complexities or overhead of managing Presto, you can use a managed service like Ahana Cloud. Ahana Cloud gives you the power of Presto without having to get under the hood. It’s a managed service for AWS and has out-of-the-box integrations with AWS S3, in addition to AWS Glue and Hive Metastore (HMS).

You can try Ahana Cloud out free for 14 days, sign up and get started today.

Presto and Amazon Redshift

What is Amazon Redshift?

Amazon Redshift is a cloud data warehouse application service used by data analysts or data warehouse engineers for analyzing data using standard SQL and your existing Business Intelligence (BI) tools. Users can start with just a few hundred gigabytes of data and scale to a petabyte or more.

What is Presto?

Presto is a federated SQL query engine for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data, which continues to grow exponentially across a wide range of data lakes and databases. Many organizations are adopting Presto as a single engine to query against all available data sources. Data platform teams are increasingly using Presto as the de facto SQL query engine to run analytics across data sources in-place. This means that Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. 

Why Presto and Amazon Redshift?

Analysts get better performance at a lower cost by using Amazon Redshift with Presto, as users can scale their workloads quickly and automatically. Presto allows users to quickly query both unstructured and structured data. Presto is an ideal workload in the cloud because the cloud provides performance, scalability, reliability, availability, and massive economies of scale. You can launch a Presto cluster in minutes, without needing to worry about node provisioning, cluster setup, Presto configuration, or cluster tuning.

Presto executes queries over data sets that are provided by plugins known as Connectors. Integrating Presto with Redshift provides users with new capabilities:

  • Presto reads data directly from HDFS, so you don’t need to perform ETL on the data. Presto has also been extended to operate over different kinds of data sources including traditional relational databases and other data sources such as Redshift.
  • The Redshift connector allows users to query and create tables in an external Amazon Redshift cluster. Users can join data between different systems like Redshift and Hive, or between two different Redshift clusters. Since Presto on Amazon EMR supports spot instances, the total cost of running a data platform is lower.
  • Presto can reduce query execution time. Presto provides the ability to run queries in seconds instead of hours, and analysts can iterate quickly on innovative hypotheses with the interactive exploration of any dataset, residing anywhere.

Since Presto is based on ANSI SQL, it’s very straightforward to start using it. The Presto connector architecture enables the federated access of almost any data source, whether a database, data lake, or other data system. Presto can start from one node and scale to thousands. With Presto, users can use SQL to run ad hoc queries whenever you want, wherever your data resides. Presto allows users to query data where it’s stored so you don’t have to ETL data into a separate system. With a Redshift connector, platform teams can quickly provide access to datasets that analysts are interested in while being able to scale and meet the growing requirements for analytics. With Presto and Redshift, you can mine the treasures in your data quickly, and use your data to acquire new insights for your business and customers.

Qlik and Presto

What Is Qlik?

Qlik is a platform providing data visualization and BI for end-to-end analysis of data. It offers interactive and user-customizable data visualizations, data federations, data analytics, and business intelligence (BI). Qlik enables its users to use data from different sources to carry out interactive data analysis and visualization. Organizations can leverage the available data for day-to-day decision-making and daily operational processes, promoting a data-driven culture. Some of the companies using Qlik include Samsung, Toyota, HSBC, Merck, and Shell.

Qlik comes in two versions, a Windows application and a cloud offering. Qlik can also be deployed on multiple cloud environments. It provides an association between data that could potentially reside in different data sources and carrying out analytics in-memory for improved performance. This provides fast analytics and a unified view of an organization’s data.

The Architecture of a Qlik App

A Qlik app consists of data source, back-end, and front-end components. The data source end is responsible for handling data access between the app the various data sources used. The back-end component container contains the Publisher which is responsible for pulling data from the different sources and providing the view files to a Qlik Server. It is mainly used by software developers, system analysts, database administrators, and other tech personnel.

image from qlik.com

The front-end accesses the views provided by the server and presents it as an interactive web app. It leverages the associative model used by Qlik to make selections, the main way of carrying out data analysis. The main users are organizational employees in different departments including marketing, sales, finance, management, and logistics.

What Is Presto?

Presto is a query engine originally developed by Facebook to quickly access the social media platform’s huge amounts of data. It is meant to connect to different data sources and to perform queries against them in a parallel and distributed manner. Presto is an open source project, housed by The Linux Foundation’s Presto Foundation.

Leveraging Distributed Computing with Qlik and Presto

For production-level data analytics and visualization, it is critical to use a distributed model to process huge amounts of data. This often requires the use of distributed and/or cloud computing solutions. By using Presto, an organization can create clusters running Qlik applications targeted at different use cases. These use cases include functions such as data aggregation, data analysis, ad-hoc querying, and data visualization. It also helps an organization develop a unified view of the data. Qlik apps only have to consider one data source when being able to query multiple data sources. This is because Presto abstracts the different data sources being used by the apps.

The common approach is to deploy different Presto clusters for different use cases. The Qlik apps then connect to the clusters, relying on the clusters to perform actual data querying. Presto is responsible for handling the queries from the apps in a parallel and distributed manner. Being an in-memory distributed query engine, PrestoDB can process large amounts in very short periods. The use of a cluster architecture means that horizontal scaling can be easily and efficiently carried out.

Presto and MySQL

What is MySQL?

MySQL is a popular open-source relational database management system (RDBMS) that data engineers use to organize data into tables. MySQL supports foreign keys to create relationships between tables. You can create, modify and extract data from the relational database using SQL commands.

MySQL is easy and intuitive to use; you only have to know a few SQL statements to use it, like SHOW DATABASES, CREATE DATABASE, USE, CREATE TABLE, SHOW TABLES, DESCRIBE, INSERT INTO … VALUES,  SELECT, WHERE, DELETE, ALTER TABLE…ADD, ALTER TABLE…DROP, and SHOW CREATE TABLE. 

MySQL has a client-server architecture, which means that many clients can communicate with the MySQL server to run queries, create tables, etc. You can scale with ease; MySQL is fully multithreaded and handles large data volumes. It supports over 20 platforms like Linux, Windows, UNIX, OS, etc. You can roll back transactions, commit and undo MySQL statements using the ROLLBACK statement. It has memory efficiency. You can partition your data, which improves query performance.

MySQL Use Cases

Data engineers don’t typically use MySQL to run queries against massive data volumes. Instead, more common use cases might include: 

•      Small web-based databases: an eCommerce store, a web-page database, and an app database (with a small data volume)

•      Online analytical processing and online transaction processing systems (OLAP/OLTP) (with small data volume)

•      Building a business intelligence tool that doesn’t have a massive data volume

If you want to query big data with your MySQL database, you can use PrestoDB.

What is PrestoDB?

PrestoDB is an open source SQL query engine to query data lakes and across data sources, without having to move your data. Presto runs queries directly on files stored in databases like MySQL. It joins multiple databases from different data sources like MySQL, Hadoop, Cassandra, etc.

Using MySQL and Presto together

Data engineers commonly use MySQL and PrestoDB for joining different systems like Hive, Cassandra, and MySQL and running queries across those data sources; it’s common to use partitioning to run queries efficiently.   

You can use MySQL and PrestoDB together to run queries against a massive data volume. PrestoDB works directly on files stored in MySQL storage.  

PrestoDB has a MySQL connector to run queries and create tables in your MySQL database. You also can join different MySQL databases or different systems like MySQL and Hive. 

PrestoDB is used to map the data in the MySQL storage engine to schemas and tables. Data engineers use the Hive metastore service to access the metadata catalog that lives in the MySQL database.

You create a catalog properties file named “catalog_name.properties” in “etc/catalog” to mount the MySQL connector. This file must have the connector.name, connection-url, connection-user, and the connection-password. You can have as many catalog files as you want.

In PrestoDB, we have schemas that contain database objects like tables. So, we have one schema for each database. To create, modify and extract data on your MySQL database using PrestoDB, you can use the next statements: 

•      USE dbname: it selects the database

•      CREATE DATABASE dbname: it creates a database

•      CREATE TABLE table_name: it creates a table

•      LOAD DATA: it loads data from a file

•      INTO TABLE table_name: it adds rows to a table

•      SHOW SCHEMAS FROM catalog_name: it shows all databases from that catalog. 

•      SHOW TABLES FROM catalog_name.dbname: it shows all tables from the database (dbname). 

•      DESCRIBE catalog_name.dbname.table_name: it shows a list of columns in the table (table_name) in the database (dbname).

•      SHOW COLUMNS FROM  catalog_name.dbname.table_name: it shows a list of columns in the table (table_name) in the database (dbname).

•      SELECT * FROM catalog_name.dbname.table_name: it shows the table (table_name) in the database (dbname). 

If you’re ready to use Presto to query MySQL but don’t want to worry about the complexities or overhead of managing Presto, you can use a managed service like Ahana Cloud. Ahana Cloud gives you the power of Presto without having to get under the hood. It’s a managed service for AWS and can integrate with MySQL.

Get started with a 14-day free trial of Ahana Cloud.

Presto and PostgreSQL

What is PostgreSQL?

PostgreSQL is an enterprise-grade open source database management system (DBMS) that is free and offers advanced functionality common to more expensive dbmses. It is an object-relational database that supports multiple programming languages, custom data types, JSON, relational modeling, and table partitioning. Some of the companies using PostgreSQL in production include Instagram, Apple, Cisco, and Red Hat.

PostgreSQL’s wide usage and support is a result of several reasons. Being open source, the database comes for free and benefits from contributions from a wide developer community. Due to its wide usage, there are many talented and experienced people who are already conversant with it. Therefore, there is no shortage of skilled personnel who can implement and maintain data management systems powered by PostgreSQL. Features such as table partitioning, replication, and geospatial data support mean it has features normally found in closed source enterprise dbmses that are expensive to acquire and run.

PostgreSQL is highly extensible and scalable and is used to manage massive datasets reliably. Support for foreign data wrappers ensures that it integrates easily to data streams and other databases. This makes it ideal for serving as the data storage system for online analytical processing (OLAP), advanced analytics, and BI solutions. Installation is simple since it is free to download. Additionally, free docker containers and paid cloud solutions exist to support a cloud-based microservices architecture for even better horizontal scalability of the developed systems.

What Is Presto?

Presto is a query engine used to query different data stores in differing formats in a distributed and parallel manner. It was originally developed by Facebook to manage its data warehouse developed using Apache Hive. 

Presto architecture consists of a coordinator node and multiple worker nodes. The coordinator node is responsible for accepting queries and returning results. The worker nodes do the actual computation and connect to the data stores. This distributed architecture makes Presto fast and scalable.

Leveraging PostgreSQL and Presto

Combining PostgreSQL and PrestoDB offers many benefits to developers and organizations. The advanced database features like geospatial data support, full text search, and support for different programming languages combined with being open source make it highly customizable. Developers are therefore free to configure it as they deem best for their systems. Using Presto enables better solutions as it is also highly customizable, open source, and distributed. Worker nodes can integrate easily with multiple nodes running the database as Presto ships with a free PostgreSQL plugin.

PrestoDB works by using the plugin to connect to postgreSQL. The plugin is responsible for generating SQL statements that are compatible with the database. Worker nodes use the plugins to communicate with the dbms to issue commands and fetch data. By pushing down as much work as possible to the database, Presto is able to leverage the power and features of PostgreSQL. This makes them a great combination for developing big data solutions. Some of the use cases of Presto and PostgreSQL are machine learning, data analysis, data visualization, and business intelligence.

If you’re ready to use Presto to query PostgreSQL but don’t want to worry about the complexities or overhead of managing Presto, you can use a managed service like Ahana Cloud. Ahana Cloud gives you the power of Presto without having to get under the hood. It’s a managed service for AWS and can integrate with PostgreSQL.

Get started with a 14-day free trial of Ahana Cloud.

Looker and Presto

What is Looker

Looker is a business intelligence (BI) platform that helps individuals, businesses, and organizations get better insights from available data. Acquired by Google in 2019 and part of GCP (Google Cloud Platform), it helps users to explore data and find meaningful trends and patterns and in data visualization. It is available on GCP and AWS as a cloud solution. It can be used to create custom solutions too as needed. Looker can connect to different databases including BigQuery, MySQL, PostgreSQL, Oracle, PrestoDB, and Vertica, and supports more than 50 different SQL dialects.

Looker users include software developers, data scientists, marketers, and management for different use cases. These include business intelligence, data analytics, and supporting  data-driven decision-making from different data sources. Its power lies in being able to interact with different data sources to create a holistic view of the available data to different users and departments. This makes it easy to manage workflows in the whole project or organization in a data-driven manner.

Since the users do not have to learn SQL, everyone in a project/organization can customize their reports, dashboards, heatmaps, and other BI presentation assets easily. Furthermore, Looker goes beyond offering rudimentary BI to including support for fresh data, machine learning, performance monitoring and ETL optimizations.

How Looker Works Under the Hood

Looker is a cloud-based platform that can be deployed to offer advanced BI. It works by using LookML to generate SQL code that is sent over a connection to a given database. This allows Looker to connect to different databases for business intelligence purposes.

A Looker project consists of one or more models that provide a connection to a database. A model can contain one or more Explores that provide an interactive webpage where a user can dynamically work with the data. 

image from looker.com

Explores contain views that map to the tables in the given database. Users are then able to issue queries to the database, filter their results, and visualize them in a variety of ways.

What is Presto?

Presto is a distributed query engine that allows in-memory processing of data from disparate data sources. It was developed by Facebook and is in use in other large companies including Netflix, Airbnb, LinkedIn, and Amazon for their data management requirements. 

Using Looker With Presto

Looker is an extremely powerful BI platform for a front-end facing app. Combining it with a presto-based back-end application enables cutting down query times to seconds rather than minutes. To create a Looker and Presto system, you need to first deploy your presto application. You then create a connection, selecting the dialect as PrestoDB.

Looker then relies on Presto to carry out query processing. The queries are run in a distributed manner and provide access to multiple data sources. Also, this provides a single source of truth in regards to the overall data model. This kind of configuration is used to provide highly scalable enterprise-level BI solutions by using Looker for BI and Presto for distributed data querying.

Jupyter and Presto

What is Jupyter

Jupyter Notebook is open-source software for creating shareable documents and reports. Each such document can easily host a whole working data pipeline. The .ipynb files can contain code chunks, free and markdown-formatted text, and even visual objects for data analysis.

Jupyter became particularly popular among data scientists and engineers using Python because it works with most Python packages. A Jupyter Notebook can quickly digest and process significant amounts of data, and you just need a couple of command lines to install it using a command shell. 

While Jupyter has no drag-and-drop elements, it still offers a few simple no-code features. The software is called “notebook” due to its interface. In a cell, you can type in a chunk of code, run it, and see the results printed to a small console underneath. Then you add another chunk and repeat the operation, getting the results appended under the second chunk. Since you have your code and all outputs in one place, ordered chronologically, it is very similar to making notes with a paper block.

image from jupyter.org

Jupyter works well for data analysis or data pipeline drafting and testing. If you need some data insights and a report based on it, you can add your comments and annotations directly into the Notebook, run the Kernel, and export the whole document as an HTML file to be sent over. You can also build a data pipeline, load and process your data, export it into a database as an Excel file, or using one of the many other methods available. Once you are satisfied that it runs smoothly and delivers adequate results, you can deploy the pipeline in your main system.

Jupyter has a fairly steep learning curve. Fortunately, its contributors have put together thorough documentation to help new users and enable troubleshooting.

What is Presto

Presto is an open source SQL query engine used to query data lakes and other data stores in differing formats in a distributed and parallel manner. It was originally developed by Facebook to manage its data warehouse developed using Apache Hive. 

The Presto architecture consists of a coordinator node and multiple worker nodes. The coordinator node is responsible for accepting queries and returning results. The worker nodes do the actual computation and connect to the data stores. This distributed architecture makes Presto fast and scalable.

Jupyter and Presto

You can use Presto with Jupyter Notebook to expand your data pools.

To connect to your Presto engine, you can use the Presto connector from the pyhive package. All Python-based connectors use similar logic. In the case of Presto, you need to enter your host, post, and then add your catalog and schema. After the connection has been established, you can run SQL queries to fetch the data.

With Presto, you can query both relational and non-relational databases and other data sources. Moreover, you can combine data from more than one data source inside only one query. Presto is also performant – calculations on terabytes of data takes minutes. Presto sends one query to multiple worker nodes using a coordinator node, distributing the workload. The coordinator will collect the data once it’s ready and dispatch it to you.

Ahana Welcomes Satish Ramakrishnan as VP, Engineering to Senior Management Team

Ramakrishnan brings deep SaaS and data management expertise to Presto managed service to lead engineering and technical operations

San Mateo, Calif. – April 7, 2021 Ahana, the self-service analytics company for Presto, announced today the appointment of Satish Ramakrishnan as VP, Engineering, reporting to Steven Mih, Cofounder and CEO. He will head up engineering and technical operations with a focus on advancing Ahana’s vision of simplifying ad hoc analytics for organizations of all shapes and sizes.  

“Satish’s appointment is an important and high-impact addition to the senior team as we expand the capabilities of the industry’s first Presto managed service,” said Mih. “As vice president of engineering, Satish will be integral to helping build out Ahana Cloud for Presto to bring the power of the most powerful open source distributed SQL query engine to any organization. He brings extensive management, technical and organizational leadership to Ahana and has an excellent track record of building high-performing teams and data management services that deliver incredible value.”

“I look forward to helping deliver on Ahana’s vision of bringing the power of Presto to data platform teams of all sizes,” said Ramakrishnan. “From my first meetings with the Ahana team, I knew this was an incredible opportunity to be involved in one of the most popular analytics engines today. I am excited to continue building the managed service platform that delivers on and exceeds customers’ expectations.” 

Satish brings over 22 years of executive experience to Ahana with deep expertise in developing data-focused SaaS services on public cloud providers, including AWS and Google Cloud Platform. Prior to Ahana, Satish was the EVP Engineering at Reltio running engineering and technical operations, where he managed the continued innovation and stabilization of their market leading Master Data Management platform while ensuring revenue growth and operating margin improvement. He has held VP positions at startups like RiseSmart, PointCast, CoreObjects, etc. as well as at large enterprises like BEA and Comcast. At Comcast, he founded the Silicon Valley Innovation Lab where he took an idea from a piece of paper all the way through launch, which became part of the Comcast “Triple Play” offering. Satish holds a B.E. in Computer Science from Mysore University.

Supporting Resources:

Download a head shot of Satish Ramakrishnan. 

Tweet this: @AhanaIO welcomes Satish Ramakrishnan as #VP #Engineering #Presto #PrestoDB #Cloud #Data #OpenSource #Analytics https://bit.ly/3fAXFdj

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto for Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Ahana is a premier member of Linux Foundation’s Presto Foundation and actively contributes to the open source PrestoDB project. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

What are the AWS Glue partition limits and does it apply to AWS Athena?

Typically you’ll use AWS Glue to create the data sources and tables that Athena will query. You could also create your tables using Athena and then use the tables in AWS Glue. Today the limit for AWS Glue partitions is 10M while Athena’s partition limit is 20K partitions per table. So if you’re using AWS Glue you get more partitions if you’re using the AWS Glue catalog with your Athena deployment. 

When it comes to using Amazon Athena, there are a lot of other limits besides partitions including query, concurrent query, and database limits. Many times AWS Athena users find these limits to hinder SLAs and other business-critical applications. As a result, they’re looking for alternatives to AWS Athena and have found that Ahana Cloud’s Presto managed service can address those issues.

Ahana Cloud is a Presto managed service that gives users the power of Presto, similar to AWS Athena, but without the limitations around queries, performance, and control. With Ahana you get the full control of your Presto deployment on AWS, so you can scale as you need without having to worry about partition or query limits.

Learn more about how Ahana Cloud compares to AWS Athena.

Ready to get started? Sign up try out Ahana Cloud for free – if you’re an AWS Athena user, it’s really easy to move from Athena to Ahana Cloud and requires not change in your existing architecture.

Picking the right approach for Presto on AWS:
Comparing Serverless vs. Managed Service

Video
Slides

Ahana Announces New Capabilities for Its Presto Managed Service Including Data Lake Caching, Security and Seamless Operations

Caching advancements improve query performance up to 5x for complex workloads making Ahana the most advanced cloud service for Presto

San Mateo, Calif. – March 24, 2021 Ahana, the self-service analytics company for Presto, announced today at PrestoCon Day significant updates to its Ahana Cloud for Presto managed service. The major areas of focus include performance, better cluster management, ease of use, and security. One of the key features is the data lake IO caching capability that can dramatically improve query performance, reducing latencies up to 80%. 

“Our latest innovations make Ahana Cloud the fastest and most advanced cloud native managed service available for Presto,” said Dipti Borkar, Cofounder and Chief Product Officer, Ahana. “We have seen that advanced capabilities like data lake IO caching can improve query performance up to 5x for real-world concurrent query workloads. This, along with our new ease of use, security and cost management advancements, continue to extend Ahana’s leadership in the market, giving data platform teams more value using the power of Presto for faster insights, easier access, and lower costs. 

“We knew we wanted to use Presto to query all of our AWS data sources, but we found it challenging to manage it on our own in the cloud,” said Omar Mohamed, Data Engineer, Cartona. “Ahana Cloud made it possible for our team to deliver real-time analytics with Presto without having to worry about the complexities. We look forward to doing even more with Ahana’s latest set of capabilities.”

Ahana Cloud for Presto features new capabilities, including:

Performance improvements

  • Data lake IO Caching – Ahana Cloud for Presto is the only cloud-native managed service for Presto and with the new data lake IO caching capability, is now also the only Presto plus caching service for users. Data lake IO caching, built on the RubiX open source project, eliminates the need to read data from data lakes like AWS S3, particularly if the same data is read over and over. Today, Ahana Cloud users can easily enable data lake caching with the click of a button when creating a new Presto cluster. The rest, including attaching SSDs and sizing the cache based on the user selected instance type, is all done automatically. Ahana has also open sourced this feature for the Presto community, which will be available in the upcoming PrestoDB release.
  • Pre-tuned Presto cluster capability – New clusters now come pre-tuned with even more advanced features enabled, including the Cost-based Optimizer and Dynamic Filtering Capabilities which can improve performance, particularly for complex queries, as well as memory tuning to enable best out-of-the-box performance of Presto.

Ease of use – A range of major improvements have been added that make Ahana Cloud the easiest service for Presto on AWS. 

  • Automated and versioned upgrades of Ahana Compute Plane – Versioning and upgrade infrastructure brings users disruption-free upgrades – new features are automatically made available to users. 
  • Ability to deploy different instance types for Presto coordinator and workers, as well as Ahana-managed Hive Metastore Catalog – This brings users flexibility to use Presto for a range of workloads and customize resources provisioned as needed. 
  • Customized IP block allocation for the Ahana-managed Virtual Private Cloud – Allows users with more restrictive networking environments to use Presto in the cloud
  • Support for asynchronous query execution with Apache Superset in the Ahana Compute Plane – Ahana manages Apache Superset so that administrators can query and validate that clusters are available for their data teams. Now they can run long running queries with Apache Superset as well. Ahana manages and automates the deployment of the infrastructure required for these query to run asynchronously to reduce the time platform teams spend on infrastructure  
  • One click creation of AWS provisioning roles and policies needed with an AWS CloudFormation Template – This dramatically simplifies setting up the initial cross-account access between Ahana and the user’s AWS account. 

Security

  • Significantly tighter policies and advanced AWS tagging for all Ahana-managed resources in the user’s account – Ahana’s managed service is deployed in the user’s AWS account, which enables compute to be deployed where the data is without the need for customer data to ever touch Ahana Cloud. With the latest improvements, the policies needed for Ahana management of the compute plane are even tighter and each resource managed by Ahana on behalf of the user is explicitly tagged. This makes the Ahana environment completely isolated and independent from any other user-managed resources.
  • Integration of Ahana Compute plane with AWS CloudTrail, including log validation – AWS CloudTrail is a service that enables governance, compliance, operational auditing, and risk auditing of your AWS account. Now with Ahana’s integration with CloudTrail, you can log, continuously monitor, and retain Ahana-administered account activity and actions across your AWS infrastructure. 

Other new upcoming capabilities in Ahana Cloud for Presto, include:

  • Advanced security with the Apache Ranger integration – You can now use your centralized security administration for fine-grained access control to your Presto clusters. This capability has also been contributed to Presto open source, making it the first open integration for Presto and Apache Ranger. 
  • Better cost management for Presto clusters – Your Presto cluster automatically reduces the number of worker nodes to a single node when the query engine is idle for a certain time. In addition, Ahana can scale clusters up and down based on CPU utilization.
  • Programmatic access to cluster management – Create, start, stop and resize your Presto clusters using our REST API. 

“The convergence of the data warehouse and data lake continues to be a growing trend, and Presto is one of the few technologies that enables ad hoc analytics for a unified analytics approach,” said John Santaferraro, Research Director, Enterprise Management Associates. “Based on interviews, EMA sees the Presto open-source community as vibrant and actively accelerating the value of open source software. Ahana Cloud makes Presto accessible to data platform teams of all sizes so they can use SQL to analyze all of their different data stores.”

Together, these new capabilities help to solidify Ahana Cloud’s position as the only provider of Presto in the cloud and give platform teams of all sizes more capabilities to easily deploy Presto at scale in the cloud.

Supporting Resources:

Tweet this: @AhanaIO announces a set of new capabilities for its Ahana Cloud for Presto managed service #Presto #PrestoDB #Cloud #Data #OpenSource #Analytics https://bit.ly/3tAqV7S

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto for Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Ahana is a premier member of Linux Foundation’s Presto Foundation and actively contributes to the open source PrestoDB project. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

PrestoCon Day Showcases Momentum of Presto Community, Features Industry Leaders Sharing Use Cases and Best Practices

Popular session topics include the Presto and Apache Iceberg Connector by Twitter and the Presto, Today & Beyond panel with Ahana, Uber and Facebook

San Mateo, Calif. – March 17, 2021 Ahana, the self-service analytics company for Presto, announced today their participation in PrestoCon Day, a day dedicated to all things PrestoDB taking place on Wednesday, March 24th. 

Following last year’s massively successful PrestoCon that had hundreds of global attendees and over 20 sessions, PrestoCon Day is a day-long event for the PrestoDB community by the PrestoDB community. In addition to being a premier sponsor of the event, Ahana will be participating in four sessions; two Ahana customers, Carbon and Cartona, will also be presenting their Presto use cases.

“PrestoCon Day will feature a lot of the work that core Presto engineers have been building over the past six months, and we’re looking forward to sharing that with the community,” said Dipti Borkar, Cofounder and Chief Product Officer, Ahana, Program Chair of PrestoCon and Chair of the Presto Foundation Outreach Committee. “In addition to the biggest internet companies like Facebook, Uber and Twitter speaking, we’ll also have modern day SaaS companies presenting how they’re leveraging the power of Presto on AWS with Ahana. We’re thrilled to host an event for the Presto community that helps members, new and seasoned, connect with one another.”

“Presto Foundation was created to accelerate and foster the adoption of Presto, and PrestoCon Day is an important event for the community to showcase its project momentum, usage and advancements,” said Chris Aniszczyk, Vice President, Developer Relations, The Linux Foundation. “We look forward to the conference and bringing the Presto community together for a day of collaboration and community-building.”

Ahana customers speaking at PrestoCon

  • How Carbon uses PrestoDB in the Cloud with Ahana to Power its Real-time Customer Dashboards by Jordan Hoggart, Data Engineer at Carbon
  • Presto on AWS using Ahana Cloud at Cartona by Omar Latif, Data Engineer at Cartona

Ahana sessions at PrestoCon

  • Simplifying Preston on AWS with Ahana Cloud Managed Service by Dipti Borkar, Cofounder & Chief Product Officer at Ahana
  • Using Presto’s BigQuery Connector for Better Performance and Ad-hoc Query in the Cloud by George Wang, Principal Software Engineer at Ahana and Roderick Yao, Product Manager at Google

Panels at PrestoCon 

  • The Presto Ecosystem with speakers from Preset, Apache Hudi, and Apache Pinot, moderated by Dipti Borkar, Cofounder & Chief Product Officer at Ahana
  • Presto, Today & Beyond featuring David Simmen, Cofounder and CTO at Ahana along with speakers from Uber and Facebook, moderated by Dipti Borkar, Cofounder & Chief Product Officer at Ahana

Additionally, industry leaders Facebook, Uber, Twitter and Alibaba will be sharing the latest innovation in the Presto project, including the Prism Presto Gateway Service (Uber), Petabyte Scale Log Analysis (Alibaba), Presto and Apache Iceberg (Twitter), and a Presto State of the Union (Facebook). 

View all the sessions in the full program schedule

PrestoCon Day is a free virtual event and registration is open. Additionally, Ahana is hosting a hands-on virtual lab the day prior to PrestoCon to help Presto users get up and running on AWS. 

Other Resources

PrestoCon Day registration

PrestoCon Day Program Schedule

Virtual Lab: Setting up your AWS S3 data lake and querying it with Presto

Tweet this: .@AhanaIO announces its participation in #PrestoCon Day #cloud #opensource #analytics #presto #prestodb BITLY

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto for Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Ahana is a premier member of Linux Foundation’s Presto Foundation and actively contributes to the open source PrestoDB project. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Upsolver and Ahana Announce Technology Partnership to Accelerate Adoption of Open Data Lake Analytics in the Cloud with Presto

Todd Odess, Head of Partnerships at Upsolver | Ali LeClerc, Head of Marketing at Ahana

Today we’re thrilled to announce our joint technology partnership. Upsolver, the first no-code, high-speed cloud compute layer for making data lakes analytics-ready and Ahana, the managed service for Presto, offer an end-to-end cloud-native solution for open data lake analytics. This solution gives anyone the ability to prepare data for analysis and query. Upsolver’s visual UI lets users easily land and automatically optimize data in AWS S3, and then query it with Ahana Cloud for Presto. 

Customers get to value quickly and affordably, without having to write and maintain Spark-based data pipelines or manage complex query engines. Over the long-term, they benefit from greater analytics agility, being able to productionize new use cases in days rather than weeks or months. 

Companies today collect and manage massive amounts of data – on AWS S3 alone there are 10,000+ data lakes. But studies show that companies only analyze 12% of their data; the rest is locked up in siloed systems, expensive on-premises solutions, and incompatible data formats. Open Data Lake Analytics helps users unlock that data. 

Open Data Lake Analytics is an approach to traditional data analytics that opens up a wider range of data for analytics in the cloud. It’s built on open source Presto, the distributed SQL query engine, and supports open formats (Apache Parquet, Apache ORC, and more) and open interfaces that integrate with existing SQL systems. It can also run on any cloud. 

Together, Ahana and Upsolver give organizations a cloud-native platform that analyzes large volumes of data at any speed – real-time or batch – automating the convoluted process of building cloud data lakes plus managing Presto. Working together enables data teams to manage complex pre-query data engineering and a distributed query engine without having deep technical knowledge. 

Upsolver provides a visual, SQL-based interface to easily create and manage tables, as well as deep integration with AWS S3, AWS Glue Catalog, and PrestoDB for high performance queries. 

We believe this partnership that provides open data lake analytics is critical to maximize the value of our customers’ data lakes. Together we accelerate time to value on data lake analytics projects and make it simple enough for any data practitioner to analyze any dataset.

This joint solution brings the power of technologies used at internet-scale companies like Facebook, Uber, Twitter to data teams of any size. It helps companies expand beyond the cloud data warehouse to a multifaceted analytical stack, opening up a broad set of advanced use cases and the flexibility.

We look forward to this new partnership with Upsolver and making the open cloud data lake accessible for everyone.

Additionally, Ahana and Upsolver are both part of the AWS Partner Network and premier members of the Linux Foundation’s Presto Foundation, a foundation that governs the community-driven PrestoDB project.

Ahana and Upsolver will be holding a free virtual hands-on workshop on March 23 to help users get up and running with the platform. Registration is now open.

The joint solution is available today, users can select Ahana as an output for Upsolver:

Send us a note if you’re interested in learning more! Looking forward to seeing all you do with Upsolver and Ahana 😀

How do I get deterministic performance out of Amazon Athena?

What is Athena?

Amazon Athena is an interactive query service based on Presto that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage. This approach is advantageous for interactive querying on datasets already residing in S3 without the need to move the data into another analytics database or a cloud data warehouse.

What is Athena great for?

Users love Athena for the simplicity, ease of getting started, and the fact that no servers need to be managed. You only pay for the amount of data scanned. Athena is closely integrated with glue, so if you are already using the glue catalog for your ETL workloads, all your tables are already defined and accessible. You can then make these tables available for interactive analytics to your extended team.

What are the trade-offs?

The simplicity of the deployment architecture of Amazon Athena, however, comes at a price of inconsistent performance with scale, as many users of Athena have already experienced.

There are two primary trade-offs with Athena. Firstly, transparency – “you get what you get” – you have no visibility to the underlying infrastructure serving your queries nor do you have the ability to control or tweak that infrastructure for your workloads. The approach works for cases where the performance/latency is not critical, but it can be a non-starter for users who need control. Secondly, despite all its advantages, shared services serverless models have disadvantages related to performance predictability.

AWS has documented best practices in their performance tuning guide to get the most out of Athena and to avoid typical errors that users encounter such as “Query exceeded local memory limit” and “Query exhausted resources at this scale factor” these include using optimized formats such as parquet and orc as well as avoid small files and partitioning the data appropriately.

How does the serverless architecture impact my query performance?

Athena at its core is a shared serverless service per region – to protect the usage spiraling out of control by a handful of customers Amazon has placed restrictions on the usage, size, concurrency, and scale of the service on a per-customer basis and within the region overall.  

These limits and restrictions are well documented in the service limits such as 20 active DDL and DML Queries each in most regions, 30-minute max query timeouts, API limits, and throttling among others (Some of these limits can be tweaked by working with Amazon support). These limits are guard rails around the service so that the usage of one customer doesn’t adversely affect the experience of another. These guardrails are however far from perfect since there are only a finite number of resources per region to be shared across customers in the region. Any excessive load due to seasonal or unexpected spikes at Amazon’s scale will easily consume the shared resources causing contention and queuing. 

In addition to the shared query infrastructure, Athena’s federated query connectors are based on Lambda, which is again serverless. Lambda scales out well and can be performant once warmed-up, however consistent performance comes only with consistent use. Depending upon the usage of a particular connector in the region and the available capacity of the backend infrastructure you could run into latencies caused by cold-starts, especially if you are using connectors that are not accessed frequently e.g. custom connectors.

If a large number of users end up using Athena at the same time, especially for large-scale queries, they often observe extended queuing of their queries. Though the eventual query execution might not take time once resources are available, the queueing significantly impacts the end-user experience for interactive query workloads. Users have also at times reported inconsistency of execution times of the same query from one run to another which ties back into the shortcomings of the shared resources model. 

So can I get deterministic performance out of Athena? 

If your queries are not complex, latency-sensitive and your infrastructure is in a less crowded region, you may not encounter performance predictability issues frequently. However, your mileage entirely depends upon several factors such as when you are running the query, which region are you running the query, the volume of the data you are accessing, your account service-limit configurations, to just name a few. 

What are my options?

If your interactive query workload is latency-sensitive and you want to deterministically control the performance of your queries and the experience of your end-users, you need dedicated managed infrastructure. A managed Presto service gives you the best of both worlds – It abstracts the complexity of managing a distributed query service at the same time giving you the knobs to tweak the service to your workload needs.

Ahana provides a managed Presto service that can scale up and down depending on your performance needs. You can segregate workloads into different clusters or choose to share the cluster. You can also choose beefier infrastructure for more business and time-critical workloads and also set up separate clusters for less critical needs. You make that choice of price, performance, and flexibility depending upon business objectives.

Do I have to use AWS Lambda to connect to data sources with Athena?

The Athena Federated Query Journey

AWS announced the public preview of Athena federated query in November 2019 and moved the service to General Availability (GA) in November 2020. As of this writing, Athena supports two versions of Athena – engine 1 based on Presto 0.172, and engine 2 based on Presto 0.217. Among other features, the federated query functionality is only supported on engine 2 of Athena. 

What connectors are supported?

Athena currently supports a number of data sources connectors

  • Amazon Athena CloudWatch Connector
  • Amazon Athena CloudWatch Metrics Connector
  • Athena AWS CMDB Connector
  • Amazon Athena DocumentDB Connector
  • Amazon Athena DynamoDB Connector
  • Amazon Athena Elasticsearch Connector
  • Amazon Athena HBase Connector
  • Amazon Athena Connector for JDBC-Compliant Data Sources (PostgreSQL, MySQL, and Amazon Redshift)
  • Amazon Athena Neptune Connector
  • Amazon Athena Redis Connector
  • Amazon Athena Timestream Connector
  • Amazon Athena TPC Benchmark DS (TPC-DS) Connector

In line with the serverless model of AWS Athena, in order to maintain an impedance match between Athena and the Connectors, the connectors are also Serverless based on AWS Lambda and run within the region. These connectors are open-sourced under the Apache 2.0 license and are available on GitHub or can be accessed from the AWS Serverless Application Repository. Third-party connectors are also available in the serverless application repository.

Athena is based on Presto, Can I use the open-source Presto data source connectors with Athena?

Presto has an impressive set of connectors right out of the box, these connectors however cannot be used as-is with Athena. The Presto service provider interface (SPI) required by the Presto connectors is different from AWS Athena’s Lambda-based implementation which is based on the Athena Query Federation SDK.

You can use the Athena Query Federation SDK to write your own connector using Lamba or to customize one of the prebuilt connectors that Amazon Athena provides and maintains. The Athena connectors use Apache Arrow format for returning data requested in the query. An example Athena connector can be found here.

If you’re looking for a managed service approach for PrestoDB, Ahana Cloud offers that based on open source Presto. Ahana’s managed deployment is 100% compatible with the open source Presto connectors, and you can get started with a click of a button.

How do I do geospatial queries and spatial joins in Presto?

A question that often comes up is “how do I do geospatial queries and spatial joins in Presto?”. Fortunately Presto supports a wealth of functions and geospatial-specific joins to get the job done.

Let’s get started with a step-by-step tutorial. First we’ll set-up some test data in two tables.  The first table is trips_table which will store 3 rows each representing simple taxi trips. We store a trip id, the origin and destination long/lat coordinates, and the journey start time:

create table trips_table (
trip_id int, 
orig_long double, 
orig_lat double, 
dest_long double, 
dest_lat double, 
trip_start timestamp);

insert into trips_table values (1, 51.50953, -0.13467, 51.503041, -0.117648, cast('2021-03-02 09:00:00 UTC' as timestamp));
insert into trips_table values (2, 34.039874, -118.258663, 34.044806, -118.235187, cast('2021-03-02 09:30:00 UTC' as timestamp));
insert into trips_table values (3, 48.858965, 2.293497,48.859952, 2.340328, cast('2021-03-02 09:45:00 UTC' as timestamp));
insert into trips_table values (4, 51.505120, -0.089522, 51.472602, -0.489912, cast('2021-03-02 10:45:00 UTC' as timestamp));

For information:

  • Trip 1 is a ride within central London, from Piccadilly Circus to the London Eye.
  • Trip 2 is a ride in downtown Los Angeles
  • Trip 3 is a ride from the Eiffel Tower to Musée du Louvre, Paris 
  • Trip 4 is a ride from Borough Market in central London to Heathrow Airport Terminal 5 (outside central London).

The second table is city_table with each row storing the shape representing an area e.g. central London, and a name for the shape. We represent the shape with a sequence of coordinates that enclose a specific area: 

create table city_table (
geo_shape varchar, 
name varchar);

insert into city_table values ('POLYGON((51.519287 -0.172316,51.519287 -0.084103,51.496393 -0.084103,51.496393 -0.172316,51.519287 -0.172316))', 'London, central');
insert into city_table values('POLYGON((33.9927 -118.3023,33.9902 -118.1794,34.0911 -118.2436,33.9927 -118.3023))', 'Los Angeles, downtown');

For information:

  • I used a simple triangle to represent downtown LA (see illustration below)
  • I used a rectangle representing central London. 
  • In each case the first pair coordinates for the shape are the same as the last pair – so it’s an enclosed bounding box or polygon we’re describing.
  • We’re storing our shapes as text in a varchar column for simplicity.
  • We describe each polygon as comma-separated pairs of long/lat coords using the POLYGON(()) function.  The double brackets are required. 
Simple polygon to represent downtown LA 
(credit: “Geospatial Data Analysis using SQL”)

Now let’s run a query to count how many trips occurred in each city. We join our two tables, and we use each journey’s originating long/lat coordinates to determine – using ST_CONTAINS() – if that point exists in any of our shapes.  This function requires the polygon to be expressed as a special type – Geometry – so we convert our shape from text using ST_GeometryFromText() function:

SELECT c.name as City, count(*) as Trips 
FROM trips_table as t 
JOIN city_table as c 
ON ST_Contains(ST_GeometryFromText(c.geo_shape), st_point(t.orig_long, t.orig_lat)) 
GROUP BY 1;

         City          | Trips 
-----------------------+-------
 Los Angeles, downtown |     1 
 London, central       |     2 
(2 rows)

We see both London trips made it into the result set, despite one of the trips ending at the airport which is a way outside the shape we defined for central London – this is because the query uses the originating coordinates for each trip, not the destination coordinates. 

Also notice the Paris trip didn’t make it into the result – this is because we did not define a shape for Paris.

In this example you’ve seen some of the benefits of using Ahana Cloud for Presto

Presto’s Geospatial functions are listed in the Presto documentation.  

We have a ton of resources to help you get started with Presto, check them out here.

How do I query JSON documents with Presto?

JSON documents are a common data type. A lot of people collect logs and load them into s3. Then in order to query JSON with Presto can be challenging because you’re not always sure of what the schema is. Because it’s JSON, the schema can be different, it won’t always have a consistent schema. The great part about Presto is that it has functionality that enables you to get insights on your JSON data.

Presto has a JSON data type. The JSON data type is similar to a string. One way to think about this is that it is just a string containing JSON, with the exception that the data inside the string has already been parsed, it’s already well-formed JSON.

But it’s also slightly semantically different in that JSON has maps. And when you have a string with a map, the keys of the map can be ordered differently. But in the JSON data type, the keys in the map are always going to be in a consistent order. And it’s worth noting that two maps of the data with different JSON data type values will compare the same, even if the keys are in different order. This can be important when doing an aggregation regrouping or a comparison. Logically two maps will be treated as the same value, no matter what the order is.  

How the JSON parse function works

The JSON parse function takes a JSON string and returns a JSON data type. Here is an example of the JSON data type:

SELECT json_parse('null'); -- JSON 'null'
SELECT json_parse('true'); -- JSON 'true'
SELECT json_parse('42'); -- JSON '42'
SELECT json_parse('"abc"'); -- JSON '"abc"'
SELECT json_parse('[1, 2, 3]'); -- JSON '[1,2,3]'
SELECT json_parse('["a": 1, "b": 2]'); -- JSON '["a": 1, "b": 2]'
 
SELECT json_parse('[”hey”, 42, {“xyz”: 100, “abc” : false}]’); 

This example has an array that contains a string, a number, and then a map. And we can see that the map has two keys, xyz, and xyz, you can see that zyz in the input comes before abc, so the keys are not in order.

After it’s parts with json_parse, the result is that the keys in the map are now ordered. This is how you result with the same JSON data type and how they compare the same. When two different maps with differently ordered keys are parsed, they’ll have the same contents.

In Presto, all the JSON functions take the JSON data type. And if you pass a string into the JSON function, it will be implicitly parsed into JSON.

Another question that comes up is what is the difference of NULL between JSON and SQL? J_null and a SQL null?

This can be confusing because JSON has a NULL and so does SQL. These are two different things. A SQL NULL is considered a special value and work very differently like they do in other programming languages. Operations on a NULL gets turned into a NULL. Whereas a JSON NULL is just another value, like a number or a string or an array. So they’re totally different. This has implications when doing Boolean expressions.

We have a lot more resources on Presto if you’d like to learn more, check out our Getting Started with Presto page.

Is there latency overhead for Presto queries if everything fits into memory and doesn’t need to be distributed?

Presto is both in-memory and distributed, so each work has memory and uses it. However, Presto is not an in-memory database. Presto is the query engine and reads from storage underneath. This is where Presto’s caching capabilities become important.

If all the data required to satisfy a given query arrives from the connected data sources and fits within the memory of the Presto cluster, everything should work accordingly. However, Presto’s MPP, in-memory pipelining architecture will accelerate workloads further in two ways:

1) Presto will ensure data is divided across all the workers to bring maximum processing power and memory capacity to bear.

2) Presto’s generated query plan will ensure execution steps are distributed amongst all the workers such that processing takes place in parallel on the data across all workers simultaneously, as efficiently as possible. Furthermore, with its caching capabilities, Presto can accelerate query execution even further for specific workload patterns, further reducing latency.

If you want to get started with Presto, check out our docs to learn more about the Ahana Cloud managed service for Presto.

Is the Hive metastore a hard dependency of Presto, or could Presto be configured to use something else like Postgres?

With Presto, there’s no hard dependency of having to use the Hive metastore – it’s catalog-agnostic. However, there are significant advantages like better performance when you use the Hive metastore.

A managed service like Ahana Cloud for Presto provides a managed version of a Hive metastore catalog, so you don’t have to deal with the complexity of managing their own metastore. Ahana also supports AWS Glue as a data catalog service.

You don’t necessarily have to use the metastore if you are connecting via the provided connectors such as MySQL, PostGres, Redshift, Elastic, etc. But if you intend to query data on S3, for instance, you will need a metastore (i.e. Hive or Glue) to define external tables and their formats.

Presto has many connectors including Postgres, which is a supported data source.

What are the differences between Presto and Apache Drill?

Drill is an open source SQL query engine which began life as a paper “Dremel: Interactive Analysis of Web-Scale Datasets” from Google in 2010. Development of Apache Drill began in 2012.

Performance & Concurrency
Drill has never enjoyed wide adoption partially because it was tied to one Hadoop distribution (MapR) but mainly because of inherent performance and concurrency limitations. There are companies that built products based on Drill who report these performance and concurrency issues, and many have migrated away from Drill as a result. Presto’s popularity and adoption, on the other hand, has exploded as numerous companies from SME to web-scale have deployed Presto and contribute to the project. These include Facebook, Uber, Alibaba, Twitter, Netflix, AirBnB and LinkedIn. Users value Presto’s vendor-neutrality and rate of innovation.

Metastores
Presto connects to external metastores (AWS Glue, Hive Metastore Catalog); many users deploy Presto + AWS Glue/Hive for their data lake analytics. In addition, schema for relational data sources can be obtained directly from each connector. On the other hand, Drill performs its own schema discovery and does not need a metastore such as Hive (but can use Hive if needed), which is a benefit.

Community
Compared to Presto, the Apache Drill community has dwindled somewhat especially when it comes to adoption and contributions. We talk to many users who are looking to move from Apache Drill to Presto. If we look at DB engines – https://db-engines.com/en/ranking/relational+dbms – we see that Presto has continued on a positive upward trend and ranks #27, as opposed to Drill which is at #49 in a downward trend.

Overall, Drill is in decline. This was accelerated when HPE (Hewlett Packard Enterprise, who acquired MapR) announced they will no longer support or contribute to Drill (a drawback of having a query engine tied to an ecosystem such as Hadoop). Presto’s popularity continues to increase, as illustrated by 1) the number of commits to the project (In the past month 46 authors have pushed 79 commits to master and 88 commits to all branches, and on master 568 files have changed and there have been 10,930 additions – as of Feb 26 2021), and 2) Presto’s continued rise on DB-engines.

Drill is a top-level Apache Foundation project but does not have a strong and active community behind it. Presto is backed by a strong community and is overseen by the Presto Foundation which is part of the Linux Foundation with 8 premier organizations driving it including Facebook, Uber, Twitter, Intel and Ahana.

Want to learn more about PrestoDB? Check out our free tutorials to help you get started.

Why am I getting zero records when I use AWS Athena to query a CSV file?

There’s a common error many AWS Athena users see when they query CSV files – they will get zero records back. This tends to happen when they run the AWS Glue crawler and create a new table on the CSV file. There are several reasons why the query might return no records.

  1. The crawler is pointing to a file instead of an AWS S3 bucket
  2. The LOCATION path is wrong or there’s a double slash in it
  3. The partitions haven’t loaded into the Glue Data Catalog or the internal Athena data catalog
  4. In general, CSV crawlers can be sensitive to different issues (i.e. embedded newlines, partially quoted files, blanks in integer fields), so make sure everything is accurate

On the first point, if you have selected a file instead of your S3 bucket, the crawler will succeed but you won’t be able to query the contents which is why you’ll see the ‘Zero Records Returned’ error message. If there are other files that you don’t want crawled, you’ll need to create a new folder and move your CSV to that new folder, and then update the include path accordingly (and you’ll need to re-crawl it).

In order to get your S3 data lake to work, you’ll need to make sure each batch of same-schema files has its own top level directory.

If you’re running into issues like this with AWS Athena and want to offload the management while still getting the power of Presto, check out Ahana Cloud. It’s a fully managed service for AWS that removes these types of complexities and makes it really easy to run Presto in the cloud.

Does Presto work natively with GraphQL?

Some users may have a primary data store that is GraphQL-based (AWS AppSync) and want to leverage Presto. For context, GraphQL falls in the application as part of the service mesh infrastructure. It is an abstraction API for mostly operational sources.

Typically for analytics, users move their data into an analytical stack, because the properties and queries of an operational stack and analytical stack are quite different.

Users with your environment move the data to S3 to run data lake analytics. Presto is the de facto engine for data lake analytics, but in addition to that is also allows for running queries across S3 and operational systems like RDS.

Running analytics on GraphQL / AppSync is not recommended because:

  1. GraphQL isn’t really a query engine with an optimizer, etc.
  2. Analytical queries are fairly intensive from a compute perspective and needs the right analytical engine to process and execute the queries. GraphQL is mostly a pass through and this will affect your operational systems.

Want to learn more about Presto? Download our free whitepaper: What is Presto?

What is a Presto Connection and how does it work?

Connections to and from Presto fall into two main categories:  

1) The Presto connectors that provide access to source data in various databases and file systems. The connectors are listed here: https://prestodb.io/docs/current/connector.html. One of the most commonly used Presto connectors is Hive which gives query access to HDFS and S3 data stored in a variety of formats.

2) Presto’s interfaces that allow Presto clusters to be accessed by tools and applications. There are four primary interfaces:

API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example., you can make a simple REST call to Presto to get a JSON dump of recently run queries using the syntax:

http://<prestoServerHost>:<port>/v1/query

The default port is 8080.

You can optionally specify a query ID – in this example the query ID is 20200926_204458_00000_68x9u:

JDBC

Presto can be accessed using SQL from Java using the JDBC driver. Download link is in the documentation: https://prestodb.io/docs/current/installation/jdbc.html. The following JDBC URL connection string formats are supported:

http://myHost:8080/v1/query/20200926_204458_00000_68x9u

jdbc:presto://host:port

jdbc:presto://host:port/catalog

jdbc:presto://host:port/catalog/schema

Here’s example Java code to establish a connection to a Presto cluster:

String sql = "SELECT * FROM sys.node";
String url = "jdbc:presto://localhost:8080/catalog/schema";
try (Connection connection =
        DriverManager.getConnection(url, "test", null)) {
    try (Statement statement = connection.createStatement()) {
        try (ResultSet rs = statement.executeQuery(sql)) {
            while (rs.next()) {
                System.out.println(rs.getString("node_id"));
            }
        }
    }
}

ODBC

Several free and paid-for options exist:

Client libraries

Presto libraries for C, Go, Java, node.js, PHP, Python, R, Ruby are available at https://prestodb.io/resources.html#libraries  

We hope the above information helps you with creating Presto Connections.

Why does a single AWS Athena query get stuck in QUEUED state before being executed?

One of the drawbacks of AWS Athena is the fact that as a user, you don’t have control over query performance and predictability. One specific issue that comes up quite a bit for Athena users is single queries getting stuck in a QUEUED state before being executed. This happens because of Athena’s serverless nature in combination with its multi-tenant service – you are essentially competing for resources when you run a query.

Many users will get the message “Query waiting in queue” somewhat randomly. For this reason, it’s recommended to not use Athena for user-facing requests because of its unpredictability, and generally many users have decided they can’t rely on Athena at all based on this issue. If you can’t predict demand and manage multi-tenancy, it’s not ideal for ad-hoc analytics (which is a core use case for PrestoDB).

For users who still want to leverage PrestoDB for ad-hoc analytics and data lake analytics, Ahana Cloud is a solution that allows you to run Presto transparently as a managed service. You won’t run into the issues around unpredictability and queries getting stuck. There’s a 14 day free trial if you’re interested in checking it out.

Presto Data Share

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Presto enables data sharing in two ways. 

1) With its broad list of data connectors, Presto provides a simple way to exploit data virtualization with federated SQL queries accessing data across multiple data sources.  Data in RDBMSs, NoSQL databases, legacy data warehouses, files in object storage can all be accessed,  and combined in a single SQL query if required.  Presto allows querying data where it lives, like in Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization by a variety of users and applications.  A complete list of Presto’s connectors can be found at  https://prestodb.io/docs/current/connector.html 

2) sharing access to data by connecting your Presto cluster to other systems, tools and applications is achieved using the Presto API or using SQL via client libraries, ODBC or JDBC interfaces:

API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example., you can make a simple REST call to Presto to get a JSON dump of recently run queries using the syntax:

http://<prestoServerHost>:<port>/v1/query 

The default port is 8080.

You can optionally specify a query ID – in this example the query ID is 20200926_204458_00000_68x9u:

http://myHost:8080/v1/query/20200926_204458_00000_68x9u

JDBC

Presto can be accessed using SQL from Java using the JDBC driver. Download link is in the documentation: https://prestodb.io/docs/current/installation/jdbc.html. The following JDBC URL connection string formats are supported:

jdbc:presto://host:port

jdbc:presto://host:port/catalog

jdbc:presto://host:port/catalog/schema

Here’s example Java code to establish a connection:

String sql = "SELECT * FROM sys.node";
String url = "jdbc:presto://localhost:8080/catalog/schema";
try (Connection connection =
        DriverManager.getConnection(url, "test", null)) {
    try (Statement statement = connection.createStatement()) {
        try (ResultSet rs = statement.executeQuery(sql)) {
            while (rs.next()) {
                System.out.println(rs.getString("node_id"));
            }
        }
    }
}

ODBC

Several free and paid-for options exist:

Client libraries

Presto libraries for C, Go, Java, node.js, PHP, Python, R, Ruby are available at https://prestodb.io/resources.html#libraries 

When it comes to using Presto to share data, we hope the above information is useful.

Presto Performance

Presto is an open source distributed ANSI SQL query engine for analytics. Presto supports the separation of compute and storage (i.e. it queries data that is stored externally – for example in Amazon S3 or in RDBMSs).  Performance is particularly important for ad-hoc analytical queries so Presto has a number of design features to maximize speed such as in-memory pipelined execution, a distributed scale-out architecture, and massively parallel processing (MPP) design.  In terms of specific performance features Presto supports:

  • Compression  (SNAPPY, LZ4, ZSTD, and GZIP)
  • Partitioning 
  • Table statistics – collected by the ANALYZE command and stored in a Hive or Glue metastore – give Presto’s query planner insights into the shape, size and type of data being queried, and whether or not the data source supports pushdown operations like filters and aggregates. 
  • Presto uses a cost-based optimizer, which as you would often expect depends on collected table statistics for optimal functioning. 

As Presto is an in-memory query engine, it can only process data as fast as the storage layer can provide it. There are MANY different types of storage that can be queried by Presto, some faster than others. So if you can choose the fastest data source this will boost Presto’s speed. This may involve tuning the source to reduce latency, increase throughput, or both. Or switching from accessing a data source that is busy dealing with lots of users and therefore high levels of contention, to an alternative – perhaps a read replica of a database. Or creating indexes, or accessing a pre-aggregated version of the data. Or perhaps moving portions of frequently used data from object storage to a faster storage layer like a RDBMS in order to meet a strict query SLA.  Other suggestions include switching to one of Presto’s supported file formats that features performance optimizations like ORC or Parquet, and consider enabling compression. 

Presto Performance Tips

The following recommendations can help you achieve maximum performance from your Presto clusters:  

  • Configure Presto’s coordinator and workers to run on separate instances/servers in production deployments. It is only recommended to have the coordinator and worker share the same instance for very small scale dev/test use. 
  • Always adjust Presto’s java memory configuration according to the available memory resources of your instance. There is no “default”, so the etc/jvm.config file on each node needs to be configured before your start Presto.  A useful rule of thumb is:  In each node’s jvm.config set -Xmx to 80% of the available physical memory initially, then adjust later based on your monitoring of the workloads. 
  • If using HDFS or S3 storage for example, consider using ORC format for your data files. There are numerous optimisations in Presto for ORC such as columnar reads (Presto is a columnar query engine), predicate pushdown, and  skipping reads of portions of files if the data is sorted (or mostly sorted) on the filtering columns.
  • Use partitioning. You can create a partitioned version of a table with a CTAS https://prestodb.io/docs/current/sql/create-table-as.html by adding  the partitioned_by clause to the CREATE TABLE.
  • Use bucketing.  Do this by adding the bucketed_by clause to your CREATE TABLE statement. You will also need to specify bucket_count. 
  • If you have a choice of metastore, choose Hive instead of Glue for your Presto cluster(s). Hive has some features that Glue does not, like column-level statistics and dynamic filtering which can boost query performance. The final decision will depend on your particular mix of systems and feature requirements.  Here’s a summary of the differences:

  • Collect table statistics to ensure the most efficient query plan is produced, which means queries run as fast as possible.  Use the sql ANALYZE TABLE <tablename>  command to do this. Repeat the ANALYZE TABLE commands for all tables involved in queries on a regular basis, typically when data has substantially changed (e.g. new data arrived  / after an ETL cycle has completed).
  • In conjunction with the above, if you are exploiting partitioning, make sure you update the partitioning information that’s stored in your metastore.  For example to sync the metastore with the partitioning in the table default.customer  use CALL system.sync_partition_metadata(‘default’, ‘customer’, ‘full’); Do this right after you create the table, and repeat this when new partitions are added.
  • Presto does not do automatic JOIN re-ordering by default. This only happens when the cost-based optimisation (CBO) feature is enabled (see below). By default you (or the application) need to make sure that smaller tables appear on the right side of the JOIN keyword. Remember: “LARGE LEFT” (put the large table on the left side of the join). 
  • If your queries are performing table joins you should try enabling the cost-based optimisation (CBO) feature – use: 
    • SET session join_distribution_type=’AUTOMATIC’;  
    • SET session join_reordering_strategy=’AUTOMATIC’; 

(both SETs are needed – and are persisted until session has ended/log out).  

  • You should enable Dynamic Filtering when 1 or more joins are in-play, especially if there’s a smaller dimension table being used to probe a larger fact table for example. Dynamic Filtering is pushed down to ORC and Parquet readers, and can accelerate queries on partitioned as well as non-partitioned tables. Dynamic Filtering is a join optimization intended to improve performance of Hash JOINs. Enable this with:
    • SET session enable_dynamic_filtering=TRUE;
  • If practical, try ordering/sorting your tables  during ingestion. This can greatly improve performance – especially the effectiveness of Dynamic Filtering. 
  • Monitor for Coordinator node overload. If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analysing, planning and optimising queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help.
  • Choose the right instances for your workers to ensure they have enough I/O.  Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 instances:

It’s a good idea to monitor the IO activity of your worker nodes to determine if there’s an IO bottleneck. 

  • Consider enabling Resource Groups. This is Presto’s workload manager and it’s used to place limits on resource usage, and can enforce queueing policies on queries that run within them or divide their resources among sub-groups. A query belongs to a single resource group, and consumes resources from that group (and its ancestors). A resource group represents the available Presto resources packaged together, with limits related to CPU, memory, concurrency, queueing, and priority.  Except for the limit on queued queries, when a resource group runs out of a resource it does not cause running queries to fail; instead new queries become queued. A resource group may have sub-groups or may accept queries, but may not do both. More details in the documentation: https://prestodb.io/docs/current/admin/resource-groups.html 


Finally, to assist with speed tests Presto has TPC-DS and TPC-H catalogs built-in to generate data for benchmarking purposes at varying scales.  For example the 1TB TPC-H dataset consists of approximately 8.66 billion records, in 8 tables. 

Need more speed? Deploy Presto in the cloud and then scale your cluster to as many or as few instances as you need, when you need them.  This “elasticity” is being made increasingly automatic by Ahana’s fully managed cloud-native Presto service. 

We hope these Presto performance recommendations help you get the most out of Presto. If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS.

Presto REST API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example, you can make a simple REST call to Presto to get a JSON dump of recently run queries using this syntax:

 http://<prestoServerHost>:<port>/v1/query

The default port for Presto is 8080 for non-secure clusters.

You can optionally specify a query ID – in this example the query ID I’m interested in is 20200926_204458_00000_68x9u:

https://myHost:443/v1/query/20200926_204458_00000_68x9u

The default port for secure Presto clusters is 443. 

Here’s a worked example using curl, specifying a user ID/password. Stats for the simple test query – select from now() – can be seen:

$ curl https://myHost:443/v1/query -u james
Enter host password for user 'james': *******


[{"queryId":"20210119_192148_00000_r4adv","session":{"queryId":"20210119_192148_00000_r4adv","transactionId":"26387f6e-6f5b-41a8-bac1-2fc4fed51e04","clientTransactionSupport":true,"user":"james","principal":"james","source":"presto-cli","timeZoneKey":2072,"locale":"en_GB","remoteUserAddress":"xxx.xxx.xxx.xxx","userAgent":"StatementClientV1/0.245.1-9966d7d","clientTags":[],"resourceEstimates":{},"startTime":1611084108707,"systemProperties":{},"catalogProperties":{},"unprocessedCatalogProperties":{},"roles":{},"preparedStatements":{}},"resourceGroupId":["global"],"state":"FINISHED","memoryPool":"general","scheduled":true,"self":"http://xxx.xxx.xxx.xxx:8585/v1/query/20210119_192148_00000_r4adv","query":"select now()","queryStats":{"createTime":"2021-01-19T19:21:49.034Z","endTime":"2021-01-19T19:21:50.014Z","queuedTime":"21.28ms","elapsedTime":"980.19ms","executionTime":"916.70ms","totalDrivers":17,"queuedDrivers":0,"runningDrivers":0,"completedDrivers":17,"rawInputDataSize":"0B","rawInputPositions":0,"cumulativeUserMemory":0.0,"userMemoryReservation":"0B","totalMemoryReservation":"0B","peakUserMemoryReservation":"0B","peakTotalMemoryReservation":"0B","peakTaskTotalMemoryReservation":"0B","totalCpuTime":"25.00ms","totalScheduledTime":"78.00ms","fullyBlocked":true,"blockedReasons":[],"totalAllocation":"0B","progressPercentage":100.0},"queryType":"SELECT","warnings":[]}]

Presto Parquet

Parquet is a columnar storage format for Hadoop, supported by Presto.  Storing data as columns as opposed to rows enables Presto to fulfill queries more efficiently. By discarding unwanted data in rows, and preventing unnecessary scans, columnar storage saves disk space and improves query performance for larger data sets.

In Parquet, data is first horizontally partitioned into groups of rows, then within each group, data is vertically partitioned into columns. Data for a particular column is stored together via compression and encoding to save space and improve performance. Each Parquet file has a footer that stores codecs, encoding information, as well as column-level statistics, e.g., the minimum and maximum number of column values. 

There have been several improvements made to Presto’s Parquet reader by the community, most notably by Uber, to enhance performance with features such as pushdown.  

For developers, a useful tool-set for working with Parquet files is parquet-tools which allows you to see the schema contained in the file, view data, examine data types etc.  Parquet-tools can be installed by pip https://pypi.org/project/parquet-tools/ or installed on a Mac with brew install parquet-tools or cloned and built from https://github.com/apache/parquet-mr

Presto’s Parquet support is a popular choice for data storage.

Presto Speed

Presto is an open source distributed ANSI SQL query engine for analytics. Presto supports the separation of compute and storage (i.e. it queries data that is stored externally – for example in Amazon S3 or in RDBMSs).  Efficiency and speed are important for query performance so Presto has a number of design features to maximize speed such as in-memory pipelined execution, a distributed scale-out architecture, and massively parallel processing (MPP) design.  In terms of specific performance features Presto supports:

  • Compression  (SNAPPY, LZ4, ZSTD, and GZIP)
  • Partitioning 
  • Table statistics – collected by the ANALYZE command and stored in a Hive or Glue metastore – give Presto’s query planner insights into the shape, size and type of data being queried, and whether or not the data source supports pushdown operations like filters and aggregates. 
  • Presto uses a cost-based optimizer, which as you would often expect depends on collected table statistics for optimal functioning. 

As Presto is an in-memory query engine, it can only process data as fast as the storage layer can provide it. There are MANY different types of storage that can be queried by Presto, some faster than others. So if you can choose the fastest data source this will boost Presto’s speed. This may involve tuning the source to reduce latency, increase throughput, or both. Or switching from accessing a data source that is busy dealing with lots of users and therefore high levels of contention, to an alternative – perhaps a read replica of a database. Or creating indexes, or accessing a pre-aggregated version of the data. Or perhaps moving portions of frequently used data from object storage to a faster storage layer like a RDBMS in order to meet a strict query SLA.  Other suggestions include switching to one of Presto’s supported file formats that features performance optimizations like ORC or Parquet, and consider enabling compression. 


To assist with speed tests Presto has TPC-DS and TPC-H catalogs built-in to generate data for benchmarking purposes at varying scales.  For example the 1TB TPC-H dataset consists of approximately 8.66 billion records, in 8 tables. 

Need more speed? Deploy Presto in the cloud and then scale your cluster to as many or as few instances as you need, when you need them.  This “elasticity” is being made increasingly automatic by Ahana’s fully managed cloud-native Presto service

Presto Tools

Several tools  and libraries are available, some are bundled with core Presto and others are downloadable extras. See the documentation Resources page at  https://prestodb.io/resources.html for details. 

Core Presto Tools:

  • Presto Console – Included with Presto, this browser-based UI allows administrators to monitor many aspects of their Presto deployment in real time, such as query workload, memory usage, execution plans etc. 

Community Contributed Tools 

Several community tools are listed at https://prestodb.io/resources.html including:

Other tools & Libraries 

Presto Documentation

Presto’s documentation is maintained by the Presto Foundation and the current version is available here: https://prestodb.io/docs/current/ 

Releases can be found at: https://github.com/prestodb and the Release Notes are at: https://prestodb.io/docs/current/release.html 

Currently the documentation site has no search function. Use Google to search the docs instead – for example if you want to find the Presto documentation page for window functions just google site:prestodb.io window functions and click the first match.

Some of the most popular and useful documentation pages are listed below for convenience: 

In addition the documentation for Ahana Cloud for Presto – a fully integrated, cloud-native managed service built for AWS – can be found at https://docs.ahana.cloud/docs

That’s your summary of Presto documentation. If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS.

Presto Query Example

Since Presto is an ANSI SQL query engine, its SQL will be very familiar to practically anyone who has used a database, despite the fact Presto is not technically a database since it stores no data itself.

Starting with the SELECT statement, the full range of keywords that you would expect to use are supported including DISTINCT,  FROM, WHERE, GROUP BY, HAVING, UNION | INTERSECT | EXCEPT, ORDER B, LIMIT and so on.  More details can be found on the Presto SELECT doc page: https://prestodb.io/docs/current/sql/select.html 

Like most other database products, CTEs are also supported – Common Table Expressions allow you to define a temporary named result set that is available in the execution scope of a statement such as SELECT or INSERT.  Here’s the same query written without and then with a CTE:

--Without CTE
SELECT a, b
FROM (
  SELECT a, MAX(b) AS b FROM t GROUP BY a
) AS x;
 
--WITH CTE
WITH x AS (SELECT a, MAX(b) AS b FROM t GROUP BY a)
SELECT a, b FROM x;

Here’s a CTE example query:

presto> WITH my_time_data (the_time_now) as ( select now() ) 
SELECT the_time_now 
FROM my_time_data;
 
             the_time_now              
---------------------------------------
 2021-01-18 12:23:12.167 Europe/London 
(1 row)

The above Presto query example and documentation link should get you up and running writing Presto queries. If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS.

Presto Best Practices

We’ve put together a list of best practices for Presto, the open source distributed SQL engine, based on our experience. We hope you find these helpful.

  • Configure Presto’s coordinator and workers to run on separate instances/servers in production deployments. It is only recommended to have the coordinator and worker share the same instance for very small scale dev/test use. 
  • Always adjust Presto’s java memory configuration according to the available memory resources of your instance. There is no “default”, so the etc/jvm.config file on each node needs to be configured before your start Presto.  A useful rule of thumb is:  In each node’s jvm.config set -Xmx to 80% of the available physical memory initially, then adjust later based on your monitoring of the workloads. 
  • If using HDFS or S3 storage for example, consider using ORC format for your data files. There are numerous optimisations in Presto for ORC such as columnar reads (Presto is a columnar query engine), predicate pushdown, and  skipping reads of portions of files if the data is sorted (or mostly sorted) on the filtering columns.
  • Use partitioning. You can create a partitioned version of a table with a CTAS https://prestodb.io/docs/current/sql/create-table-as.html by adding  the partitioned_by clause to the CREATE TABLE.
  • Use bucketing.  Do this by adding the bucketed_by clause to your CREATE TABLE statement. You will also need to specify bucket_count. 
  • If you have a choice of metastore, choose Hive instead of Glue for your Presto cluster(s). Hive has some features that Glue does not, like column-level statistics and dynamic filtering which can boost query performance. The final decision will depend on your particular mix of systems and feature requirements.  Here’s a summary of the differences:

  • Collect table statistics to ensure the most efficient query plan is produced, which means queries run as fast as possible.  Use the sql ANALYZE TABLE <tablename>  command to do this. Repeat the ANALYZE TABLE commands for all tables involved in queries on a regular basis, typically when data has substantially changed (e.g. new data arrived  / after an ETL cycle has completed).
  • In conjunction with the above, if you are exploiting partitioning, make sure you update the partitioning information that’s stored in your metastore.  For example to sync the metastore with the partitioning in the table default.customer  use CALL system.sync_partition_metadata(‘default’, ‘customer’, ‘full’); Do this right after you create the table, and repeat this when new partitions are added.
  • Presto does not do automatic JOIN re-ordering by default. This only happens when the cost-based optimisation (CBO) feature is enabled (see below). By default you (or the application) need to make sure that smaller tables appear on the right side of the JOIN keyword. Remember: “LARGE LEFT” (put the large table on the left side of the join). 
  • If your queries are performing table joins you should try enabling the cost-based optimisation (CBO) feature – use: 
    • SET session join_distribution_type=’AUTOMATIC’;  
    • SET session join_reordering_strategy=’AUTOMATIC’; 

(both SETs are needed – and are persisted until session has ended/log out).  

  • You should enable Dynamic Filtering when 1 or more joins are in-play, especially if there’s a smaller dimension table being used to probe a larger fact table for example. Dynamic Filtering is pushed down to ORC and Parquet readers, and can accelerate queries on partitioned as well as non-partitioned tables. Dynamic Filtering is a join optimization intended to improve performance of Hash JOINs. Enable this with:
    • SET session enable_dynamic_filtering=TRUE;
  • If practical, try ordering/sorting your tables  during ingestion. This can greatly improve performance – especially the effectiveness of Dynamic Filtering. 
  • Monitor for Coordinator node overload. If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analysing, planning and optimising queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help.
  • Choose the right instances for your workers to ensure they have enough I/O.  Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 instances:

It’s a good idea to monitor the IO activity of your worker nodes to determine if there’s an IO bottleneck. 

  • Consider enabling Resource Groups. This is Presto’s workload manager and it’s used to place limits on resource usage, and can enforce queueing policies on queries that run within them or divide their resources among sub-groups. A query belongs to a single resource group, and consumes resources from that group (and its ancestors). A resource group represents the available Presto resources packaged together, with limits related to CPU, memory, concurrency, queueing, and priority.  Except for the limit on queued queries, when a resource group runs out of a resource it does not cause running queries to fail; instead new queries become queued. A resource group may have sub-groups or may accept queries, but may not do both. More details in the documentation: https://prestodb.io/docs/current/admin/resource-groups.html 

We hope you find these Presto best practices useful. If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS.

How Does Presto Join Data?

Because Presto is a distributed system composed of a coordinator and workers, each worker can connect to one or more data sources through corresponding connectors.

The coordinator receives the query from the client and optimises and plans the query execution, breaking it down into constituent parts, to produce the most efficient execution steps. The execution steps are sent to the workers which then use the connectors to submit tasks to the data sources. The tasks could be file reads, or SQL statements, and are optimised for the data source and the way in which the source organises its data, taking into account partitioning and indexing for example.

The data sources supported by Presto are numerous and can be an RDBMS, a noSQL DB, or Parquet/ORC files in an object store like S3 for example. The data sources execute the low level queries by scanning, performing filtering, partition pruning etc. and return the results back to the Presto workers. The join operation (and other processing) is performed by the workers on the received data, consolidated, and the joined result set is returned back to the coordinator.

You will notice Presto uses a “push model” which is different, for example, to Hive’s “pull model”. Presto pushes execution steps to the data sources, so some processing happens at the source, and some happens in Presto’s workers. The workers also communicate between each other, and the processing takes place in memory which makes it very efficient, suitable for interactive queries.  Hive on the other hand will read/pull a block of a data file, execute tasks, then wait for the next block, using the map reduce framework. Hive’s approach is not suitable for interactive queries since it is reading raw data from disk and storing intermediate data to disk, all using the framework MapReduce, which is better suited to long-running batch processing. This diagram compares Hive and Presto’s execution approaches:

How Presto Joins Relational and Non-Relational Sources

The next diagram shows some of Presto’s core Coordinator components, and the kinds of tasks   Presto’s workers handle. In this simplistic example there are two data sources being accessed; one Worker is scanning a Hive data source, the other worker is scanning a mongoDB data source. Remember Presto does not use Hive’s mapreduce query engine or HQL – the diagram’s “hive” worker means it is using the “hive connector” and the file system is the metastore information, and the raw source data is external to Presto, maybe in HDFS in Parquet or Orc format, for example.  The Worker dealing with mongo data is described as being on “the probe side” (in this example) whereby the mongo data is read, processed, normalized into columnar stores in Presto and then shuffled (or exchanged) across the cluster to the “builder” worker (the worker dealing with the hive data here) for the actual join to take place.

This is a simplistic example since in reality Presto is more sophisticated – the join operation could be running in parallel across multiple workers, with a final stage running on one node (since it cannot be parallelized). This final stage is represented by the third worker at the top of the diagram labeled  “Output”. This outputNode’s task is to stream out the result set back to the coordinator, and then back to the client.

Joins – Summary

It’s easy to see how Presto is the “Polyglot Data Access Layer” since it doesn’t matter where your data lives, any query can access any data, in-place, without ETL or data shipping or duplication. It is true federation. Even when blending very different sources of data, like JSON data in elasticsearch or mongodb with tables in a MySQL RDBMS, Presto takes care of the flattening and processing to provide a complete, unified view of your data corpus.

If you want to try out Presto, take a look at Ahana Cloud. It provides a managed service for Presto in AWS.

When I run a query with AWS Athena, I get the error message ‘query exhausted resources on this scale factor’. Why?

AWS Athena is well documented in having performance issues, both in terms of unpredictability and speed. Many users have pointed out that even relatively lightweight queries on Athena will fail. One part of the issue may be due to how many columns the user has in the Group By clause – even a small amount of columns (like less than 5 columns) will run into this issue of not having enough resources to complete. Other times it may be due to how much data is being parsed, and again even small amounts of data (like less than 200MB) will run into this issue of not having enough resources to complete.

Presto stores Group By columns in memory while it works to match rows with the same group by key. The more columns that are in the Group By clause, the fewer number of rows that will get collapsed with the aggregation. To address this problem, users will have to reduce the number of columns in the Group By clause and retry the query.

And still at other times, the issue may not be how long the query takes but if the query runs at all. Users that experience “internal errors” on queries one hour will re-run the same queries that triggered those errors and they will succeed.

Ultimately, AWS Athena is not predictable when it comes to query performance. That’s where Ahana Cloud, a managed service for Presto, can help. Ahana’s managed service for PrestoDB can help with some of the trade offs associated with a serverless service.

Some of the reasons you might want to try a managed service if you’re running into performance issues with AWS Athena:

  • You get full control of your deployment, including the number PrestoDB nodes in your deployment and the node instance-types for optimum price/performance. 
  • Consistent performance because you have full control of the deployment
  • Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, increased resilience, speed, and ease of use. 
  • No limits on queries

Plus you can use your existing metastore, so you don’t need to modify your existing architecture. Check out the Ahana Cloud free trial and get started today.

I use RDS Postgres databases and need some complex queries done which tend to slow down my databases for everyone else on the system. What do I need to consider as I add a data lake for the analytics?

Background

Many medium-sized companies start out using one of the six flavors of Amazon RDS: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. As they grow they can have numerous instances of RDS with numerous databases for each customer, whether internal or external. When one customer tries a large analytic query, that workload can cause problems with the RDS cluster, perhaps making it to drop other workloads, fail, or slow down for others. As the needs for processing huge amounts of data increases, so does the need to take your analytics to the next level.

In addition to your operational databases, the idea is to have a much more open analytics stack where you have the ability to run different kinds of processing on the same data. A modern analytics stack lets your organization have much more insights without impacting your operational side. And doing that with open data formats is another key consideration.

Considerations

There’s a couple options for evolving your analytics stack. One would be to use a cloud data warehouse like Amazon Redshift or Snowflake. Another would be to use open formats in a data lake with a modern SQL query engine. In the first case, there are some advantages of having the highest performance possible on your data but it comes at a certain cost as well as an amount of lock-in, as you cannot easily get at the data in proprietary formats. Considering the data lake with query engine option, we believe that Presto is one of the best choices because of its performance, scalability, and flexibility to connect to S3-based data lakes and federate other data sources as well.

So our recommendation would be to run Presto on top of data stored in an open Parquet or ORC format in S3. Doing it this way, you can put other engines on top of the data as needed, do you’re not going to face a lot of rework in the future should you decide to change something.

From OLTP to Data Lake Analytics

The high level concept is to have an initial one-time bulk migration of your data in OLTP databases to get a copy moved in to S3. After that, as your operational databases will continue to generate or change data, you’ll need establish a pipeline, a stream, or a Change Data Capture (CDC) process in place to get those into S3. BTW, not often will you want data going back from S3 into your relational databases.

While there are different ways to pipe data into S3, one AWS recommended approach is to use the AWS Database Migration Service a.k.a. DMS(much in the same way you may have used it when you migrated off-prem). With AWS Database Migration Service, you can get the first one-time bulk load and then continuously replicate your data with high availability and stream data to Amazon S3. AWS DMS would run in the background and handle all of the data changes for you. You can pick the instance and what period of time you’d like it to run, for example, you may want it to run hourly partitions or daily partitions. That’ll depend on how fast your data is changing and what your requirements are for analytics as well.

Next you’ll want to install Presto on top, and for that you can build a Presto cluster yourself, or simply use Ahana Cloud for Presto to create pre-configured clusters in about 30 minutes.

It’s also worth noting that after you’ve ingested the data into S3 based on what you think the most optimized format or folder structure would be, you may find out that you you need it different. In that case, not to worry, you can use Presto itself to do data lake transformations as well, using SQL you can do a CTAS, Create Table As Select:

The CREATE TABLE AS SELECT (CTAS) statement is one of the most important SQL features available. CTAS is a parallel operation that creates a new table based on the output of a SELECT statement. CTAS is the simplest and fastest way to create and insert data into a table with a single command.

Summary

By adding a modern analytics stack to your operational databases, you evolve your analytics capabilities and deliver more insights for better decisions. We suggest moving to an open data lake analytics reference architecture with Presto. This allow a meta analysis of all your data, giving a look at broader trends across databases and other data sources.

Ahana Cofounders Make Data Predictions for 2021

Open Analytics, Open Source, Multi-Cloud and Federated, Disaggregated Stack Rise to the Top 

San Mateo, Calif. – January 6, 2021 Ahana’s Cofounder and Chief Product Officer, Dipti Borkar, and Cofounder and Chief Technology Officer, Dave Simmen predict major developments in cloud, data analytics, databases and data warehousing in 2021. 

As the shift to the cloud and multi-cloud environments has become even greater during the past year hastened by the challenges of the COVID-19 pandemic, new challenges have arisen when it comes to managing data and workloads. Companies want to keep their data secure in their own accounts and environments but still leverage the cloud for analytics. They want quick analytics on their data lakes. They also want to take advantage of containers for their applications across different cloud environments. 

Dipti Borkar, Co-founder and Chief Product Officer, outlines the major trends she sees on the horizon in 2021:

  • Open Source for Analytics & AINext year will see a rise in usage of analytic engines like Presto and Apache Spark for AI applications because of its open nature – open source license, open format, open interfaces, and open cloud. 
  • Open Source for Open Analytics More technology companies will adopt an open source approach for analytics compared to the proprietary formats and technology lock-in that came with the traditional data warehousing. This open analytics stack uses open source Presto as the core engine; open formats such as JSON, Apache ORC, Apache Parquet and others; open interfaces such as standard JDBC / ODBC drivers to connect to any reporting / dashboarding / notebook tool and ANSI SQL compatibility; and is open cloud. 
  • Containerizing Workloads for Multi-Cloud EnvironmentsAs the importance of a multi-cloud approach has gained traction over the past year, next year more companies will run container workloads in their multi-cloud environments. To do that, they’ll need their compute processing to be multi-cloud ready and containerized out of the box, so choosing a compute framework will become even more critical for these workloads. Engines like Presto, which are multi-cloud ready and container friendly, will become the core engine of these multi-cloud containerized workloads.
  • Less Complexity, More Kubernetes-ity for SaaSContainers provide scalability, portability, extensibility and availability advantages, but managing them is not seamless and, in fact, is often a headache. Kubernetes takes that pain away for building, delivering and scaling containerized apps. 2021 will bring more managed SaaS apps running on K8s, and those that are able to abstract the complexities of their platforms from users will emerge as the winners.
  • The New “In-VPC” Deployment ModelAs cloud adoption has become mainstream, companies are creating and storing the majority of their data in the cloud, especially in cost-efficient Amazon S3-based data lakes. To address data security concerns, these companies want to remain in their own Virtual Private Cloud (VPC). As a result, 2021 will bring in a new cloud-native architecture model for data-focused managed services, which I’m calling the “In-VPC” deployment model. It separates the control plane from the compute and data planes for better security and cleaner management.

Dave Simmen, Co-founder and Chief Technology Officer, outlines the major trends he sees on the horizon in 2021:

  • The Next Evolution of Analytics Brings a Federated, Disaggregated Stack – A federated, disaggregated stack that addresses the new realities of data is displacing the traditional data warehouse with its tightly coupled database. The next evolution of analytics foresees that a single database can no longer be the solution to support a wide range of analytics as data will be stored in both data lakes and a range of other databases. SQL analytics will be needed for querying both the data lake and other databases. We’ll see this new disaggregated stack become the dominant standard for analytics with SQL-based technologies like the Presto SQL query engine at the core, surrounded by notebooks like Jupyter and Zeppelin and BI tools like Tableau, PowerBI, and Looker.
  • SQL Is the New…SQL As companies shift their data infrastructure to a federated (one engine queries different sources), disaggregated (compute is separate from storage is separate from the data lake) stack, we’ll see traditional data warehousing and tightly coupled database architectures relegated to legacy workloads. But one thing will remain the same when it comes to this shift – SQL will continue to be the lingua franca for analytics. Data analysts, data engineers, data scientists and product managers along with their database admins will use SQL for analytics.

Tweet this: @AhanaIO announces 2021 #Data Predictions #cloud #opensource #analytics https://bit.ly/39j6HHl

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto for Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Ahana is a premier member of Linux Foundation’s Presto Foundation and actively contributes to the open source PrestoDB project. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

What is DB Presto Online?

If you are looking for online resources for Presto like the docs, then start here https://prestodb.io/ 

If you are looking for information on how to tell if your Presto cluster is online or offline then read on.

Presto Mac/Brew Install

If you installed Presto on your Mac using brew then the service is called presto-server and you manage it like this:

$ presto-server run

To tell if your presto server is online:

$ presto-server status
Not Running

Presto manual Install 

If you are running a regular deployment of open source Presto you start/stop/check status with:

$ bin/launcher stop
$ bin/launcher start
$ bin/launcher status
Running as 61824

To be sure your presto server is online point your browser at its host IP (e.g.  localhost:8080) and you should see the Presto Console’s “Cluster Overview” UI.

Advanced SQL Queries with Presto

Advanced SQL features and functions are used by analysts when, for example, complex calculations are needed, or when many tables (perhaps from multiple sources) need to be joined, when dealing with nested or repeated data, dealing with time-series data or complex data types like maps, arrays, structs and JSON, or perhaps a combination of all these things.  

Presto’s ANSI SQL engine supports numerous advanced functions which can be split into the following categories – links to the PrestoDB documentation are provided for convenience:

Running advanced SQL queries can benefit greatly from Presto’s distributed, in-memory processing architecture and cost-based optimizer. 

Presto Platform Overview

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes. Some of the world’s most innovative and data-driven companies like Facebook, Walmart and Netflix depend on Presto for querying data sets ranging from gigabytes to petabytes in size. Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.

Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Presto allows querying data where it lives, including Hive, Cassandra, relational databases, HDFS, object stores, or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization. Presto is an in-memory distributed, parallel system. 

Presto is targeted at analysts who expect response times ranging from sub-second to minutes. Presto breaks the false choice between having fast analytics using an expensive commercial solution or using a slow “free” solution that requires excessive hardware. A single Presto query can combine data from multiple sources. 

The Presto platform is composed of:

  • Two types of Presto servers: coordinators and workers. 
  • One or more connectors: Connectors link Presto to a data source such as Hive or a relational database. You can think of a connector the same way you think of a driver for a database. 
  • Cost Based Query Optimizer and Execution Engine. Parser. Planner. Scheduler.
  • Drivers for connecting tools, including JDBC. The Presto-cli tool. The Presto Console. 

In terms of organization the community owned and driven PrestoDB project is supported by the Presto Foundation, an independent nonprofit organization with open and neutral governance, hosted under the Linux Foundation®. Presto software is released under the Apache License 2.0.

Ahana offers a managed service for Presto in the cloud. You can get started for free today.

0 to Presto in 30 minutes with AWS and Ahana Cloud
Video & Slides

Video
Slides

How To Stop Presto

If you are using the presto-admin tool this is how to stop Presto safely:

$ presto-admin server stop

In addition these commands are also useful:

presto-admin server start
presto-admin server restart
presto-admin server status

If you are using Presto on EMR, you can restart Presto with:

sudo stop presto
sudo start presto

(If you have changed any configuration params you should do this on every node where you made a change).

The above “How To Stop Presto” information is correct for PrestoDB. But other “hard forks” of Presto may use different methods to stop and start. 

Presto New Releases

Where is the latest release of PrestoDB? And where can I find the release notes? Where is the documentation? These are common questions with easy answers. Presto’s web site https://prestodb.io/ and GitHub have all the information you need:

Releases: https://github.com/prestodb for the Presto and Presto-admin repositories.

Release Notes: https://prestodb.io/docs/current/release.html 

Documentation: https://prestodb.io/docs/current/ 

The above list has all the main resources you need for working with Presto’s New Releases. 

How Much Memory Should I Give A Presto Worker Node?

Presto is an in-memory query engine and so naturally memory configuration and management is important.  

JVM Memory

Presto’s JVM memory config nearly always needs to be configured – you shouldn’t be running Presto with its default setting of 16GB of memory per worker/coordinator. 

The Xmx flag specifies the maximum memory allocation pool for a Java virtual machine. Change the  -Xmx16G in the jvm.config file to a number based on your cluster’s capacity, and number of nodes.  See https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html on how to do this.  

Rule of Thumb

It is recommended you set aside 15-20% of total physical memory for the OS.  So for example if you are using EC2 “r5.xlarge” instances which have 32GB of memory,  32GB-20% = 25.6GB so you would use  -Xmx25G in the jvm.config file for coordinator and worker (or  -Xmx27G if you want to go with 15% for the OS).

This is assuming there are no other services running on the server/instance, so maximum memory can be given to Presto. 

Presto Memory 

Like with JVM above, there are two memory related settings that you should check before starting Presto. For most workloads Presto’s other memory settings will work perfectly well when left at their defaults. There are configurable parameters that control memory allocation that could be useful for specific workloads however.  The practical guidelines below will 1) help you decide if you need to change your Presto memory configuration, and 2) which parameters to change.

Workload Considerations

You may want to change Presto’s memory configuration to optimise ETL workloads versus analytical workloads, or for high query concurrency versus single-query scenarios.  There’s a great in-depth blog on Presto’s memory management written by one of Presto’s contributors at  https://prestodb.io/blog/2019/08/19/memory-tracking which will guide you in making more detailed tweaks. 

Configuration Files & Parameters 

When first deploying Presto there are two memory settings that need checking. Locate the config.properties files for both the coordinator and worker. There are two important parameters here: query.max-memory-per-node and query.max-memory.  Again see https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html for rules-of-thumb and how to configure these parameters based on the available memory. 

The above guidelines and links should help you when considering how much memory should you give a worker node.   To avoid all memory configuration work we recommend using Ahana Cloud for Presto – a fully managed service for Presto, that needs zero configuration.  

Ahana Announces General Availability of Managed Service for Presto on AWS; Delivers Combined Solution with Intel to Drive Adoption of Open Data Lakes Analytics

San Mateo, Calif. – December 9, 2020 Ahana, the self-service analytics company for Presto, announced today the General Availability of Ahana Cloud for Presto, the first cloud-native managed service focused on Presto on Amazon Web Services (AWS). Additionally, Ahana announced a go-to-market solution in collaboration with Intel via its participation in the Intel Disruptor Program to offer an Open Data Lake Analytics Accelerator Package for Ahana Cloud users that leverages Intel Optane on the cloud with AWS. 

Ahana Cloud for Presto is the only easy-to-use, cloud-native managed service for Presto and is deployed within the user’s AWS account, giving customers complete control and visibility of clusters and their data. In addition to the platform’s use of Amazon Elastic Kubernetes Services (Amazon EKS) and a Pay-As-You-Go (PAYGO) pricing model in AWS Marketplace, the new release includes enhanced manageability, security and integrations via AWS Marketplace.

 Ahana Cloud for Presto includes: 

  • Easy-to-use Ahana SaaS Console for creation, deployment and management of multiple Presto clusters within a user’s AWS account bringing the compute to user’s data 
  • Support for Amazon Simple Storage Service (Amazon S3), Amazon Relational Database (Amazon RDS) for MySQL, Amazon RDS for PostgreSQL and Amazon Elasticsearch 
  • Click-button integration for user-managed Hive Metastores and Amazon Glue
  • Built-in hosted Hive Metastore that manages metadata for data stored in Amazon S3 data lakes 
  • Pre-integrated and directly queryable Presto query log and integrations with Amazon CloudWatch 
  • Cloud-native, highly scalable and available containerized environment deployed on Amazon EKS

“With Ahana Cloud being generally available, the power of Presto is now accessible to any data team of any size and skill level. By abstracting away the complexities of deployment, configuration and management, platform teams can now deploy ‘self-service’ Presto for open data lake analytics as well as analytics on a range of other data sources,” said Dipti Borkar, Cofounder and Chief Product Officer, Ahana. “Users are looking for analytics without being locked-in to proprietary data warehouses. This offering brings a SaaS open source analytics option to users with Presto at its core, using open formats and open interfaces.”

“As Ahana Cloud users, we saw from day one the value the platform brings to our engineering team,” said Kian Sheik, Data Engineer, ReferralExchange. “Within about an hour we were up and running Presto queries on our data, without having to worry about Presto under the covers. With out-of-the-box integrations with a Hive data catalog and no configurations needed, Ahana Cloud takes care of the intricacies of the system, allowing our team to focus on deriving actionable insights on our data.”

Ahana also announced its participation in the Intel Disruptor Program to drive the adoption of Open Data Lake Analytics. Together, Ahana and Intel will offer an Open Data Lake Analytics Accelerator Package, available for Ahana Cloud users that leverage Intel Optane on AWS. It includes special incentives and PAYGO pricing. An Open Data Lake Analytics approach is a technology stack that includes open source, open formats, open interfaces, and open cloud, a preferred approach for companies that want to avoid proprietary formats and technology lock-in that come with traditional data warehouses. The offering is aimed at improving joint customers’ experience of running Presto in the cloud to help power the next generation of analytics use cases.

“We look forward to working with Ahana and helping bring this compelling open data lake analytic solution to market,” said Arijit Bandyopadhyay, CTO of enterprise analytics & AI within Intel’s data platform group. “As more companies require data to be queried across many different data sources like Amazon S3, Amazon Redshift and Amazon RDS, Presto will become even more mission critical. Intel Optane coupled with the Ahana Cloud platform provides superior analytical performance and ease of use for Presto, enabling data-driven companies to query data in place for open data lake analytics.”

Availability and Pricing
Ahana Cloud for Presto is available in AWS Marketplace, with support for Microsoft Azure and Google Cloud to be added in the future. Ahana Cloud for Presto is elastically priced based on usage, with PAYGO and annual options via AWS Marketplace, starting from $0.25 per Ahana cloud credit. 

The Ahana/Intel Open Data Lakes Accelerator Package is available today via an AWS Marketplace Private Offer.

Supporting Resources

Tweet this: .@AhanaIO announces GA of #ManagedService for #Presto; Joins Intel Disruptor Program #cloudnative #opensource #analytics @prestodb https://bit.ly/2LiE6cL

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto for Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Ahana is a premier member of Linux Foundation’s Presto Foundation and actively contributes to the open source PrestoDB project. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

How do I show tables from all schemas with Presto?

In Presto it is straightforward to show all tables in a schema e.g. If we have a MySQL data source/catalog that has a “demo” schema we use show tables in mysql.demo; but this only reveals the tables managed by that data source.

There is no equivalent way to show all tables in all schemas for a data source. However there’s the metastore to fall back on which we can query: In MySQL, Glue, Hive and others there is a schema called “information_schema” which contains a table called “tables”.  This maintains a list of  schemas and tables relating to that data source.

For a MySQL data source:

presto> select table_schema, table_name from mysql.information_schema.tables order by 1,2;
    table_schema    |                  table_name                   
--------------------+-----------------------------------------------
 demo               | geography                                     
 demo               | state                                         
 demo               | test                                          
 information_schema | applicable_roles                              
 information_schema | columns                                       
 information_schema | enabled_roles                                 
 information_schema | roles                                         
 information_schema | schemata                                      
 information_schema | table_privileges                              
 information_schema | tables                                        
 information_schema | views                                         
 sys                | host_summary                                  
 sys                | host_summary_by_file_io                       
 sys                | host_summary_by_file_io_type                  
...

For Ahana’s integrated Hive metastore:

presto:demo> select table_schema, table_name from ahana_hive.information_schema.tables order by 1,2;
    table_schema    |        table_name         
--------------------+---------------------------
 csv_test           | yellow_taxi_trips         
 csv_test           | yellow_taxi_trips_orc     
 csv_test           | yellow_taxi_trips_staging 
 information_schema | applicable_roles          
 information_schema | columns
...    

This should help you show tables from all schemas.

How do I convert Bigint to Timestamp with Presto?

UNIX timestamps are normally stored as doubles. If you have UNIX timestamps stored as big integers then you may encounter errors when trying to cast them as timestamps:

presto> select col1 from table_a;
     col1      
------------
 1606485526 
 1606485575 
 
presto> select cast(col1 as timestamp) from table_a;
Query 20201127_150824_00052_xnxra failed: line 1:8: Cannot cast bigint to timestamp

There is a solution!  Presto’s from_unixtime() function takes a UNIX timestamp and returns a timestamp:

presto> select col1,from_unixtime(col1) as ts from table_a;
    col1    |          ts          
------------+-------------------------
 1606485526 | 2020-11-27 13:58:46.000 
 1606485575 | 2020-11-27 13:59:35.000 

And we can optionally modify the format of the result by using date_format():

presto> select date_format(from_unixtime(col1),'%Y-%m-%d %h:%i%p') from table_a;
        _col0        
---------------------
 2020-11-27 01:58PM 
 2020-11-27 01:59PM 

That’s how to use from_unixtime() to convert a bigint to timestamp. 

How do I convert timestamp to date with Presto?

Luckily Presto has a wide range of conversion functions and they are listed in the docs.  Many of these allow us to specifically convert a timestamp type to a date type.

To test this out we can use Presto’s built-in current_timestamp function (an alias for the now() function) that returns the current system time as a timestamp:

presto> select current_timestamp as "Date & Time Here Now";
         Date & Time Here Now          
---------------------------------------
 2020-11-27 13:20:04.093 Europe/London 
(1 row)

To grab the date part of a timestamp we can simply cast to a date:

presto> select cast(current_timestamp as date) as "Today's date";
 Today's date 
--------------
 2020-11-27   
(1 row)

Or we can use date() which is an alias for cast(x as date):

presto> select date(current_timestamp) as "Today's date";
 Today's date 
--------------
 2020-11-27   
(1 row)

We can use date_format() which is one of Presto’s MySQL-compatible functions: 

presto:demo> select date_format(current_timestamp, '%Y%m%d') as "Today's date";
 Today's date  
----------
 20201127 
(1 row)

Finally we can use format_datetime() which uses a format string compatible with JodaTime’s DateTimeFormat pattern format:

presto:demo> select format_datetime(current_timestamp, 'Y-M-d') as "Date";
  Date  
----------
 2020-11-27 
(1 row)

The above 5 examples should allow you to convert timestamps to dates in any scenario.

How do I configure Case Sensitive Search with Presto?

When dealing with character data, case sensitivity can be important when  searching for specific matches or patterns. But not all databases and query engines behave in the same way. Some are case insensitive by default, some are not. How do we configure things so they behave in the way we want?

Here’s an example of why we might need to take steps to control case sensitivity. We’re accessing a MySQL database directly:

mysql> select * from state where name='iowa';
+------+----+--------------+
| name | id | abbreviation |
+------+----+--------------+
| Iowa | 19 | IA           |
+------+----+--------------+
1 row in set (0.00 sec)

MySQL is case-insensitive by default. Even though the MySQL column contains the capitalized string ‘Iowa’ it still matched the query’s restriction of ‘iowa’.  This may be acceptable, but in some use cases it could lead to unexpected results.

Using Presto to access the same MySQL data source things behave differently, and arguably, in a more expected way:

presto:demo> select * from state where name='iowa';
 name | id | abbreviation 
------+----+--------------
(0 rows)
 
Query 20201120_151345_00001_wjx6r, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:07 [1 rows, 0B] [0 rows/s, 0B/s]
 
presto:demo> select * from state where name='Iowa';
 name | id | abbreviation 
------+----+--------------
 Iowa | 19 | IA           
(1 row)

Now we only get a match with ‘Iowa’, and not with ‘iowa’. Presto has essentially made this data source (MySQL) case sensitive, even though it is exactly the same database in both the above examples, with default configurations used.

Reconfigure Case Sensitivity 

With a RDBMS like MySQL you can configure the collation setting to control if you want case sensitivity or not.  You can set the collation at the database creation or table creation level as a part of the CREATE statement. Or you can use ALTER to change the collation of a database, table or individual column. This is described in MySQL’s documentation. 

But how do you change Presto to be case-insensitive? Presto does not support collation, because it is not a database and doesn’t store data. And there is no configuration parameter that controls this.  To manage case sensitivity in Presto, and mimic collation, we rewrite the query to force case insensitivity explicitly by using:

  • simple lower() or upper() functions

Example:

select * from state where lower(name)='california';
    name      
------------
 california 
 California 
 CALIFORNIA 
(3 rows)

This query has matched any upper/lower case combination in the table, mimicking case insensitivity.

Or regular expressions. 

Example:

select * from state where regexp_like(name, '(?i)california');
    name      
------------
 california 
 California 
 CALIFORNIA 
(3 rows)

The regular expression syntax (?i) means matches are case insensitive. 

When it comes to Case Sensitive Search Configuration you are now an eXpErT.

When should I use ORC versus Parquet when using Presto?

If you’re working with open data lakes using open source and open formats, you can have multiple formats. Presto works with both. You’ll probably want to optimize for your workloads. 

Both ORC and Parquet store data in columns. Parquet is most efficient when it comes to storage and performance while ORC is ideal for storing compact data and skipping over irrelevant data without complex or manually maintained indices. For example, ORC is typically better suited for dimension tables which are slightly smaller while Parquet works better for the fact tables, which are much bigger.

What’s the advantage of having your own Hive metastore with Presto? How does it compare to Amazon Glue?

First let’s define what Apache Hive is versus Amazon Glue. Apache Hive reads, writes, and manages large datasets using SQL. Hive was built for Hadoop. AWS Glue is a fully managed ETL service for preparing and loading data for analytics. It automates ETL and handles the schemas and transformations. AWS Glue is serverless, so there’s no infrastructure needed to provision or manage it (you only pay for the resources used while your jobs are running).

Presto isn’t a database and does not come with a catalog, so you’d want to use Hive to read/write/manage your datasets. Presto abstracts a catalog like Hive underneath it. You can use the Glue catalog as the default Hive metastore for Presto.

With Ahana Cloud, you don’t really need to worry about integrating Hive and/or AWS Glue with Presto. Presto clusters created with Ahana come with a managed Hive metastore and pre-integrated Amazon S3 data lake bucket. Ahana takes care connecting external catalogs like Hive and Amazon Glue, so you can focus more on analytics and less on integrating your catalogs manually. You can also create managed tables as opposed to external tables.

How do you find out data type of value with Presto?

Presto has a typeof() function to make finding the data type of a value easy. This is particularly useful when you are getting values from nested maps for example and the data types need to be determined.

Here’s a simple example showing the type returned by Presto’s now() function:

presto:default> select now() as System_Timestamp, typeof( now() ) as "the datatype is";
           System_Timestamp            |     the datatype is      
---------------------------------------+--------------------------
 2020-11-18 15:15:09.872 Europe/London | timestamp with time zone 

Some more examples:

presto:default> select typeof( 'abc' ) as "the datatype is";
 the datatype is 
-----------------
 varchar(3)      
 
presto:default> select typeof( 42 ) as "the datatype is";
 the datatype is 
-----------------
 integer         
 
presto:default> select typeof( 9999999999 ) as "the datatype is";
 the datatype is 
-----------------
 bigint          
 
presto:default> select typeof( 3.14159 ) as "the datatype is";
 the datatype is 
-----------------
 decimal(6,5)

Armed with this info you should now be able to find out the data types of values.

How do you rotate rows to columns with Presto?

Sometimes called pivoting, here is one example of how to rotate row data with Presto.  

Suppose we have rows of data like this:

'a', 9
'b', 8
'a', 7 

We want to pivot this data so that all the ‘a’ row values are arranged in one column, and all the ‘b’ row values are in a second column like this:

ab
9
8
7

To rotate from rows to columns we will add an id to make aggregation easy. We will name the output columns a and b, and we’ll include the id in our result set. This is how we do the rotation in Presto, using VALUES() to supply the test data, and simple conditional CASE WHEN END logic:

presto:default> SELECT id
, MAX(CASE WHEN key = 'a' THEN value END) AS a
, MAX(CASE WHEN key = 'b' THEN value END) AS b 
FROM (VALUES (1, 'a', 9), (2, 'b', 8), (3, 'a', 7 )) as test_data (id, key, value) 
GROUP BY id ORDER BY id;
 
 id |  a   |  b   
----+------+------
  1 |    9 | NULL 
  2 | NULL |    8 
  3 |    7 | NULL 
(3 rows)

There are other SQL options for transforming (pivoting) rows into columns – you can use the map_agg function for example.

The code sample and description here should help when you need to rotate data from rows to columns using Presto.

If you want to get up and running with Presto, Ahana Cloud’s cloud managed service built for AWS is the easiest way to do that. See our docs for more details.

How do you rotate columns to rows with Presto?

Sometimes called unpivoting, here is one example of how to rotate column data with Presto.  

Suppose we have some integer data arranged in two columns called a and b:

ab
9
8
7

We want to rotate the columns into rows like this, where for every ‘a’ column value we now see a row labeled ‘a’, and ditto for the b values:

'a', 9
'b', 8
'a', 7

To rotate from columns to rows in Presto we’ll use a CTE and VALUES() to supply the test data, and simple conditional CASE WHEN END logic coupled with a sub-select and a UNION:

presto:default> with testdata(value_a, value_b) as ( VALUES (9,null), (null,8), (7,null) ) 
select key, value from 
(select 'a' as key, value_a as value 
from testdata 
UNION select 'b' as key, value_b as value 
from testdata) 
where value is not null;
 
 key | value 
-----+-------
 a   |     9 
 b   |     8 
 a   |     7 
(3 rows)

There are other SQL options for rotating (unpivoting) columns into rows: The CROSS JOIN unnest function is similar to LATERAL VIEW explode function.

The code sample and description here should help when you need to rotate data from columns to rows using Presto.

What are the operational benefits of using a managed service for Presto with Ahana Cloud?

First let’s hear from an AWS Solution Architect: “Ahana Cloud uses the best practices of both a SaaS provider and somebody who would build it themselves on-premises. So the advantage with the Ahana Cloud is that Ahana is really doing all the heavy lifting, and really making it a fully managed service, the customer of Ahana does not have to do a lot of work, everything is spun up through cloud formation scripts that uses Amazon EKS, which is our Kubernetes Container Service. The customer really doesn’t have to worry about that. It’s all under the covers that runs in the background. There’s no active management required of Kubernetes or EKS. And then everything is deployed within your VPC. So the VPC is the logical and the security boundary within your account. And you can control all the egress and ingress into that VPC. So you, as the customer, have full control and the biggest advantage is that you’re not moving your data. So unlike some SaaS partners, where you’re required to push that data or cache that data on their side in their account, with the Ahana Cloud, your data never leaves your account, so your data remains local to your location. Now, obviously, with federated queries, you can also query data that’s outside of AWS. But for data that resides on AWS, you don’t have to push that to your SaaS provider.”

Now that you have that context, lets get more specific, let’s say you want to create a cluster initially, it’s a just a couple of clicks with Ahana Cloud. You can pick the the coordinator instance type and the Hive metastore instance type and it is all flexible. Instead of using the Ahana-provided Hive metastore, you can bring your own Amazon Glue catalog. Then of course its easy to add data sources. For that, you can add in JDBC endpoints for your databases. Ahana has those integrated in and then Ahana Cloud automatically restarts the cluster.

Compared to EMR or if you’re running with other distributions, all of this has to be done manually:

  • you have to create a catalog properties file for each data source
  • restart the cluster on your own
  • scale the cluster manually
  • add your own query logs and statistic
  • rebuild everything when you stop and restart clusters

With Ahana, all of this manual complexity is taken away. For scaling up, if you want to grow the analytics jobs over time, you can add nodes seamlessly. Ahana Cloud and other distributions can add the nodes to the cluster while your services are still up and running. But the part that isn’t seamless is when you stop the entire cluster. In addition to all the workers and the coordinator being provisioned, the configuration and the cluster connections to the data sources, and the Hive metastore are all maintained with Ahana Cloud. And so when you restart the cluster back up, all of that comes up pre-integrated with the click of a button: the nodes get provisioned again, and you have access to that same cluster to continue your analytics service. This is very important, because otherwise, you would have to manage it on your own, including the configuration management and reconfiguration of the catalog services. Specifically for EMR, for example, when you terminate a cluster, you lose track of that cluster altogether. You have to start from scratch and reintegrate the whole system.

How do you do a Lateral View Explode in Presto?

Hive’s explode() function takes an array (or a map) as input and outputs the elements of the array (map) as separate rows. Explode is a built-in Table-Generating Function (UDTF) in hive and can be used in a SELECT expression list and as a part of LATERAL VIEW.

The explode function doesn’t exist in Presto; instead we can use Presto’s similar UNNEST. 

Here’s an example using test results data in json form as input, from which we compute the average score per student.  We use the WITH clause to define a common table expression (CTE) named example with a column alias name of data. The VALUES function returns a table rowset. 

WITH example(data) as 
(
    VALUES
    (json '{"result":[{"name":"Jarret","score":"90"},{"name":"Blanche","score":"95"}]}'),
    (json '{"result":[{"name":"Blanche","score":"76"},{"name":"Jarret","score":"88"}]}')
)
SELECT n.name as "Student Name", avg(n.score) as "Average Score"
FROM example
CROSS JOIN
    UNNEST ( 
        CAST (JSON_EXTRACT(data, '$.result')
        as ARRAY(ROW(name VARCHAR, score INTEGER )))
    ) as n
--WHERE n.name='Jarret'
GROUP BY n.name;

Student Name | Average Score 
-------------+---------------
 Jarret      |          89.0 
 Blanche     |          85.5 
(2 rows)

The UNNEST function takes an array within a column of a single row and returns the elements of the array as multiple rows.

CAST converts the JSON type to an ARRAY type which UNNEST requires.

JSON_EXTRACT uses a jsonPath expression to return the array value of the result key in the data.

This code sample and description should help when you need to to do a lateral view explode in Presto.

If you want to get up and running with Presto, Ahana Cloud’s cloud managed service built for AWS is the easiest way to do that. See our docs for more details.

How do you cross join unnest a JSON array?

Here’s an example using test results data in json form as input, from which we compute the average score per student.  We use the WITH clause to define a common table expression (CTE) named example with a column alias name of data. The VALUES function returns a table rowset. 

WITH example(data) as 
(
    VALUES
    (json '{"result":[{"name":"Jarret","score":"90"},{"name":"Blanche","score":"95"}]}'),
    (json '{"result":[{"name":"Blanche","score":"76"},{"name":"Jarret","score":"88"}]}')
)
SELECT n.name as "Student Name", avg(n.score) as "Average Score"
FROM example
CROSS JOIN
    UNNEST ( 
        CAST (JSON_EXTRACT(data, '$.result')
        as ARRAY(ROW(name VARCHAR, score INTEGER )))
    ) as n
--WHERE n.name='Jarret'
GROUP BY n.name;

Student Name | Average Score 
-------------+---------------
 Jarret      |          89.0 
 Blanche     |          85.5 
(2 rows)

The UNNEST function takes an array within a column of a single row and returns the elements of the array as multiple rows.

CAST converts the JSON type to an ARRAY type which UNNEST requires.

JSON_EXTRACT uses a jsonPath expression to return the array value of the result key in the data.

The UNNEST approach is similar to Hive’s explode function.

This code sample and description should help when you need to execute a cross join to unnest a JSON array. 

How can you write the output of queries to S3 easily?

With Ahana Cloud, we’ve made it easy for you write output of queries to S3. While there’s a variety of formats, here’s an example:

presto> CREATE SCHEMA ahana_hive.s3_write WITH (location = 's3a://parquet-test-bucket/');
CREATE SCHEMA

presto> CREATE TABLE ahana_hive.s3_write.my_table
WITH (format = 'PARQUET')
AS SELECT <your query here> ;

Does Amazon Athena do joins across other data sources besides S3? Does Amazon Athena connect to other data sources?

With Amazon Athena you’re limited in scope when it comes to doing joins across other data sources like relational data systems and more. You have to set up a Lambda, which then connects with your database in the back, which is an additional piece you have to manage on your own.


With Ahana Cloud, you can directly connect to many different types of data sources including S3, MySQL, PostgreSQL, Redshift, Elastic, and more with a few clicks, and you’re ready to integrate with any cluster.

If I have catalogs connected and configurations attached to my Presto cluster, what happens when I take the cluster down?

If you’re managing Presto on your own, either through your own installation or through a service like AWS EMR or AWS Athena, you have to maintain and manage all of the catalogs and configurations attached to your cluster. That means that if you take your cluster down, you’ll lose those catalogs and configurations – they are not maintained for your cluster.


You can use the Ahana Cloud managed service for Presto to help with this. Ahana Cloud manages all of that for you, so you don’t have to worry about losing catalogs and configurations attached to your Presto cluster. You also get Presto bundled with data sources like the Hive metastore, Apache Superset, and more.

Optimize Presto EMR

What is Amazon EMR?

Amazon Elastic MapReduce (EMR) simplifies running big data and analytics frameworks like Presto for scalable compute in the cloud. It provides on-demand, scalable Hadoop clusters for processing large data sets. You can move large volumes of data into and out of AWS datastores like S3 with Amazon EMR. AWS EMR uses Amazon EC2 instances for fast provisioning, scalability and high availability of compute power. 

With EMR, users can spin up Hadoop clusters and start processing data in minutes, without having to manage the configuration and tuning of each cluster node required for an on-premises Hadoop installation. Once the analysis is complete, clusters can be terminated instantly, saving on the cost of compute resources.

As a Hadoop distribution, AWS EMR incorporates various Hadoop tools, including Presto, Spark and Hive, so that users can query and analyze their data. With AWS EMR, data can be accessed directly from AWS S3 storage using EMRFS (Elastic MapReduce File System) or copied into HDFS (Hadoop Distributed File System) on each cluster instance for the lifetime of the cluster. In order to persist data stored in HDFS, it must be manually copied to S3 before the cluster is terminated.

What is Presto?

Presto is an open source, federated SQL query engine, optimized for running interactive queries on large data sets and across multiple sources. It runs on a cluster of machines and enables interactive, ad hoc analytics on large amounts of data. 

Presto enables querying data where it lives, including Hive, AWS S3, Hadoop, Cassandra, relational databases, NoSQL databases, or even proprietary data stores. Presto allows users to access data from multiple sources, allowing for analytics across an entire organization.

Using Presto on Amazon EMR

With Presto and AWS EMR, users can run interactive queries on large data sets with minimal setup time. AWS EMR handles the provisioning, configuration and tuning of Hadoop clusters. Providing you launch a cluster with Amazon EMR 5.0.0 or later, Presto is included automatically as part of the cluster software. Earlier versions of AWS EMR include Presto as a sandbox application.

AWS EMR And Presto Configurations

As a query engine, Presto does not manage storage of the data to be processed; it simply connects to the relevant data source in order to run interactive queries. In AWS EMR, data is either copied to HDFS on each cluster instance or read from S3. With EMR 5.12.0 onwards, by default Presto uses EMRFS to connect to Amazon S3. EMRFS extends the HDFS API to S3, giving Hadoop applications, like Presto, access to data stored in S3 without additional configuration or copying of the data. For earlier versions of AWS EMR, data in S3 can be accessed using Presto’s Hive connector.

Real world applications

Jampp is a mobile app marketing platform that uses programmatic ads to acquire new users and retarget those users with relevant ads. It sits between advertisers and their audiences, so real time bidding of media advertising space is critical for their business. The amount of data Jampp generates as part of the bidding cycle is massive – 1.7B events are tracked per day, 550K/sec requests are received, and 100TB of data is processed by AWS elastic load balancers per day. PrestoDB plays a critical role in their data infrastructure. Jampp relies on AWS EMR Presto for their ad hoc queries and performs over 3K ad hoc queries/day on over 600TB of queryable data.

Ahana Welcomes Database Pioneer David E. Simmen to Executive Team as Chief Technology Officer

Former Splunk Fellow and Chief Architect & Teradata Fellow brings 30+ years of database expertise to lead technology innovation

San Mateo, CA — July 23, 2020Ahana, the SQL analytics company for Presto, today announced the appointment of David E. Simmen as Chief Technology Officer. Simmen will oversee the company’s technology strategy while driving product innovation. His extensive database experience will further accelerate Ahana’s vision of simplifying ad hoc analytics for organizations of all shapes and sizes.

“Dave has a remarkable record of innovation in distributed databases, data federation and advanced analytics. That, coupled with his over 30 years of architecting and building in the field, is nothing short of impressive,” said Steven Mih, Co-founder & CEO of Ahana. “His experience innovating distributed database technology with Splunk, Teradata and IBM provides Ahana the depth needed for truly groundbreaking development to achieve our vision of unified, federated analytics. I couldn’t be more thrilled to have Dave at the helm as CTO and Co-founder.”

Simmen joined Ahana most recently from Apple where he engineered iCloud database services. Prior to Apple, he was Chief Architect with Splunk and named the first Fellow in the company’s history. There, he built Splunk’s multi-year technical roadmap and architecture in support of enterprise customers. Prior to Splunk, Simmen was Engineering Fellow and CTO of Teradata Aster where he set the technical direction and led the architecture team that built advanced analytics engines and storage capabilities in support of big data discovery applications.

Earlier in his career, Simmen was a Senior Technical Staff Member (STSM) at IBM Research adding innovations to DB2 like the IBM Starburst SQL compiler, support for MPP and SMP, database federation, and many more. Simmen is also a named inventor on 37 U.S. patents and has 15 publications to his name including Fundamental Techniques for Order Optimization, Robust Query Processing through Progressive Optimization, and Accelerating Big Data Analytics With Collaborative Planning.

“Given the spread of data across data lakes, data warehouses, and the plethora of relational and non-relational data services being offered today, the need for unified, federated analytics is only increasing. Presto is one of the most popular federated query engines but it’s missing fundamental optimizations developed over the decades by academic researchers and industry pioneers,” said Simmen. “That needs to be applied to Presto, making it even more powerful to eventually become the heart of the unified, federated analytical stack. I look forward to working with the Ahana team to turn this vision into reality.”

“In my time with Dave at IBM’s Almaden Research Center and then with the DB2 optimizer team at Silicon Valley Laboratory, I saw firsthand his technical expertise and his broader leadership skills,” said Dr. Laura Haas, Dean of College of Information and Computer Sciences at UMass Amherst and retired IBM Fellow. Dean Haas is best known for her work on the Starburst query processor and Garlic, a federated engine. 

Dean Haas continued, “His combined leadership and technical depth is unparalleled, and it’s wonderful that he has joined Ahana to drive innovation for the next generation of data federation technology for analytics. I look forward to seeing his vision come to life in this space.”

Resources:

Download a headshot of David E. Simmen.

Links to David Simmen’s published research papers 

Tweet this:  .@AhanaIO welcomes Database Pioneer David E. Simmen to Executive Team as #CTO #database #opensource #analytics #cloud bit.ly/30xmwWr

About Ahana

Ahana, the SQL analytics company for Presto, is focused on evangelizing the Presto community and bringing simplified ad hoc analytics offerings to market. As the Presto market continues to grow exponentially, Ahana’s mission is to enable unified, federated analytics. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV and Leslie Ventures. Follow Ahana on LinkedIn and Twitter.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Ahana Announces Linux Foundation’s PrestoDB Now Available on AWS Marketplace and DockerHub

Ahana to provide community and commercial support for PrestoDB users 

San Mateo, CA — June 30, 2020Ahana, the Presto-based analytics company, announced today the availability of the completely open source PrestoDB Amazon Machine Image (AMI) on the AWS Marketplace as well as a PrestoDB container on DockerHub. The PrestoDB AMI is the first and only completely open source and completely free Presto offering available on the AWS Marketplace. These free PrestoDB offerings make it significantly easier for data platform teams to get started with Presto in the cloud particularly for interactive, ad hoc analytics on S3 data lakes and other popular data sources like AWS RDS, Redshift, Amazon’s Elasticsearch service and others. 

Additionally, Ahana will provide commercial support for users who require technical help and management of their container or AMI-based PrestoDB clusters.

PrestoDB is a federated SQL engine for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data, which continues to grow exponentially across a wide range of data lakes and databases. As a result, data platform teams are increasingly using Presto as the de facto SQL query engine to run analytics across data sources in-place, without the need to move data. One of the fastest growing projects in the data analytics space, PrestoDB is hosted by the Linux Foundation’s Presto Foundation and is the same project running at massive scale at Facebook, Uber and Twitter. 

“We’re looking forward to making it easier for the Presto community to get started with Presto,” said Dipti Borkar, co-founder and Chief Product Officer at Ahana and Outreach Committee Chairwoman of the Presto Foundation. “This is the first and only fully open source version of Presto that’s available on the AWS Marketplace, and we hope to see continued growth and adoption of PrestoDB as the federated query engine of choice for analytics across data lakes and database.”

In addition to the PrestoDB AMI and container, new Presto Sandbox offerings are also available to users getting started with Presto. These include a new Sandbox AMI on the AWS Marketplace and a Sandbox container on DockerHub. They come preconfigured with the Hive Metastore catalog to allow users to query data easily from S3, MySQL and other data sources as well as with built-in data sets like the TPC-DS benchmark data.

The new AWS AMIs and container offerings are yet another way for PrestoDB users to install and get started with the software. Other options include manual installation and using Homebrew. All variations offer PrestoDB as a completely open source technology.

“Having tracked the data space for more than two decades, I’ve seen wave after wave of innovation. What Presto brings to the table is something we’ve long sought: an efficient engine for federated queries across widely disparate datasets,” noted Eric Kavanagh, CEO of The Bloor Group. “This allows organizations to retain their investments in databases, data warehouses, data lakes, and so-called lake houses, while expediting and amplifying business value. The approach Ahana is taking around making PrestoDB easier to use and providing commercial support are key pieces in helping widen the adoption of the technology.” 

Two Presto offerings, one name

Despite similar names, PrestoDB and PrestoSQL are two different offerings. While other variations of Presto are available on the marketplace like the Starburst Data AMI (based on PrestoSQL), they are paid offerings with proprietary features. The PrestoDB AMI, on the other hand, is 100% open source and available for use in production immediately. 

As the original project that came out of Facebook in 2013, PrestoDB is hosted under the auspices of the Linux Foundation’s Presto Foundation. The PrestoSQL fork was announced in 2019 and is backed by the Presto Software Foundation, led by the original creators of Presto who left Facebook.

Resources

Tweet this:  .@AhanaIO announces @prestodb now available on @awsmarketplace; Ahana will provide support #opensource #analytics #cloud https://bit.ly/2BGPudo

About Ahana

Ahana, the Presto-based analytics company, is the only company focused on unifying the PrestoDB community and bringing PrestoDB-focused ad-hoc analytics offerings to market. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

bethwinkowski@gmail.com