How Does Presto Join Data?

Because Presto is a distributed system composed of a coordinator and workers, each worker can connect to one or more data sources through corresponding connectors.

The coordinator receives the query from the client and optimises and plans the query execution, breaking it down into constituent parts, to produce the most efficient execution steps. The execution steps are sent to the workers which then use the connectors to submit tasks to the data sources. The tasks could be file reads, or SQL statements, and are optimised for the data source and the way in which the source organises its data, taking into account partitioning and indexing for example.

The data sources supported by Presto are numerous and can be an RDBMS, a noSQL DB, or Parquet/ORC files in an object store like S3 for example. The data sources execute the low level queries by scanning, performing filtering, partition pruning etc. and return the results back to the Presto workers. The join operation (and other processing) is performed by the workers on the received data, consolidated, and the joined result set is returned back to the coordinator.

You will notice Presto uses a “push model” which is different, for example, to Hive’s “pull model”. Presto pushes execution steps to the data sources, so some processing happens at the source, and some happens in Presto’s workers. The workers also communicate between each other, and the processing takes place in memory which makes it very efficient, suitable for interactive queries.  Hive on the other hand will read/pull a block of a data file, execute tasks, then wait for the next block, using the map reduce framework. Hive’s approach is not suitable for interactive queries since it is reading raw data from disk and storing intermediate data to disk, all using the framework MapReduce, which is better suited to long-running batch processing. This diagram compares Hive and Presto’s execution approaches:

How Presto Joins Relational and Non-Relational Sources

The next diagram shows some of Presto’s core Coordinator components, and the kinds of tasks   Presto’s workers handle. In this simplistic example there are two data sources being accessed; one Worker is scanning a Hive data source, the other worker is scanning a mongoDB data source. Remember Presto does not use Hive’s mapreduce query engine or HQL – the diagram’s “hive” worker means it is using the “hive connector” and the file system is the metastore information, and the raw source data is external to Presto, maybe in HDFS in Parquet or Orc format, for example.  The Worker dealing with mongo data is described as being on “the probe side” (in this example) whereby the mongo data is read, processed, normalized into columnar stores in Presto and then shuffled (or exchanged) across the cluster to the “builder” worker (the worker dealing with the hive data here) for the actual join to take place.

This is a simplistic example since in reality Presto is more sophisticated – the join operation could be running in parallel across multiple workers, with a final stage running on one node (since it cannot be parallelized). This final stage is represented by the third worker at the top of the diagram labeled  “Output”. This outputNode’s task is to stream out the result set back to the coordinator, and then back to the client.

Joins – Summary

It’s easy to see how Presto is the “Polyglot Data Access Layer” since it doesn’t matter where your data lives, any query can access any data, in-place, without ETL or data shipping or duplication. It is true federation. Even when blending very different sources of data, like JSON data in elasticsearch or mongodb with tables in a MySQL RDBMS, Presto takes care of the flattening and processing to provide a complete, unified view of your data corpus.

If you want to try out Presto, take a look at Ahana Cloud. It provides a managed service for Presto in AWS.

When I run a query with AWS Athena, I get the error message ‘query exhausted resources on this scale factor’. Why?

AWS Athena is well documented in having performance issues, both in terms of unpredictability and speed. Many users have pointed out that even relatively lightweight queries on Athena will fail. One part of the issue may be due to how many columns the user has in the Group By clause – even a small amount of columns (like less than 5 columns) will run into this issue of not having enough resources to complete. Other times it may be due to how much data is being parsed, and again even small amounts of data (like less than 200MB) will run into this issue of not having enough resources to complete.

Presto stores Group By columns in memory while it works to match rows with the same group by key. The more columns that are in the Group By clause, the fewer number of rows that will get collapsed with the aggregation. To address this problem, users will have to reduce the number of columns in the Group By clause and retry the query.

And still at other times, the issue may not be how long the query takes but if the query runs at all. Users that experience “internal errors” on queries one hour will re-run the same queries that triggered those errors and they will succeed.

Ultimately, AWS Athena is not predictable when it comes to query performance. For those who want to take advantage of Presto and get consistent and predictable query performance you can control, Ahana Cloud provides a managed service for Presto that runs in AWS.

I use RDS Postgres databases and need some complex queries done which tend to slow down my databases for everyone else on the system. What do I need to consider as I add a data lake for the analytics?

Background

Many medium-sized companies start out using one of the six flavors of Amazon RDS: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. As they grow they can have numerous instances of RDS with numerous databases for each customer, whether internal or external. When one customer tries a large analytic query, that workload can cause problems with the RDS cluster, perhaps making it to drop other workloads, fail, or slow down for others. As the needs for processing huge amounts of data increases, so does the need to take your analytics to the next level.

In addition to your operational databases, the idea is to have a much more open analytics stack where you have the ability to run different kinds of processing on the same data. A modern analytics stack lets your organization have much more insights without impacting your operational side. And doing that with open data formats is another key consideration.

Considerations

There’s a couple options for evolving your analytics stack. One would be to use a cloud data warehouse like Amazon Redshift or Snowflake. Another would be to use open formats in a data lake with a modern SQL query engine. In the first case, there are some advantages of having the highest performance possible on your data but it comes at a certain cost as well as an amount of lock-in, as you cannot easily get at the data in proprietary formats. Considering the data lake with query engine option, we believe that Presto is one of the best choices because of its performance, scalability, and flexibility to connect to S3-based data lakes and federate other data sources as well.

So our recommendation would be to run Presto on top of data stored in an open Parquet or ORC format in S3. Doing it this way, you can put other engines on top of the data as needed, do you’re not going to face a lot of rework in the future should you decide to change something.

From OLTP to Data Lake Analytics

The high level concept is to have an initial one-time bulk migration of your data in OLTP databases to get a copy moved in to S3. After that, as your operational databases will continue to generate or change data, you’ll need establish a pipeline, a stream, or a Change Data Capture (CDC) process in place to get those into S3. BTW, not often will you want data going back from S3 into your relational databases.

While there are different ways to pipe data into S3, one AWS recommended approach is to use the AWS Database Migration Service a.k.a. DMS(much in the same way you may have used it when you migrated off-prem). With AWS Database Migration Service, you can get the first one-time bulk load and then continuously replicate your data with high availability and stream data to Amazon S3. AWS DMS would run in the background and handle all of the data changes for you. You can pick the instance and what period of time you’d like it to run, for example, you may want it to run hourly partitions or daily partitions. That’ll depend on how fast your data is changing and what your requirements are for analytics as well.

Next you’ll want to install Presto on top, and for that you can build a Presto cluster yourself, or simply use Ahana Cloud for Presto to create pre-configured clusters in about 30 minutes.

It’s also worth noting that after you’ve ingested the data into S3 based on what you think the most optimized format or folder structure would be, you may find out that you you need it different. In that case, not to worry, you can use Presto itself to do data lake transformations as well, using SQL you can do a CTAS, Create Table As Select:

The CREATE TABLE AS SELECT (CTAS) statement is one of the most important SQL features available. CTAS is a parallel operation that creates a new table based on the output of a SELECT statement. CTAS is the simplest and fastest way to create and insert data into a table with a single command.

Summary

By adding a modern analytics stack to your operational databases, you evolve your analytics capabilities and deliver more insights for better decisions. We suggest moving to an open data lake analytics reference architecture with Presto. This allow a meta analysis of all your data, giving a look at broader trends across databases and other data sources.

Ahana Cofounders Make Data Predictions for 2021

Open Analytics, Open Source, Multi-Cloud and Federated, Disaggregated Stack Rise to the Top 

San Mateo, Calif. – January 6, 2021 Ahana’s Cofounder and Chief Product Officer, Dipti Borkar, and Cofounder and Chief Technology Officer, Dave Simmen predict major developments in cloud, data analytics, databases and data warehousing in 2021. 

As the shift to the cloud and multi-cloud environments has become even greater during the past year hastened by the challenges of the COVID-19 pandemic, new challenges have arisen when it comes to managing data and workloads. Companies want to keep their data secure in their own accounts and environments but still leverage the cloud for analytics. They want quick analytics on their data lakes. They also want to take advantage of containers for their applications across different cloud environments. 

Dipti Borkar, Co-founder and Chief Product Officer, outlines the major trends she sees on the horizon in 2021:

  • Open Source for Analytics & AINext year will see a rise in usage of analytic engines like Presto and Apache Spark for AI applications because of its open nature – open source license, open format, open interfaces, and open cloud. 
  • Open Source for Open Analytics More technology companies will adopt an open source approach for analytics compared to the proprietary formats and technology lock-in that came with the traditional data warehousing. This open analytics stack uses open source Presto as the core engine; open formats such as JSON, Apache ORC, Apache Parquet and others; open interfaces such as standard JDBC / ODBC drivers to connect to any reporting / dashboarding / notebook tool and ANSI SQL compatibility; and is open cloud. 
  • Containerizing Workloads for Multi-Cloud EnvironmentsAs the importance of a multi-cloud approach has gained traction over the past year, next year more companies will run container workloads in their multi-cloud environments. To do that, they’ll need their compute processing to be multi-cloud ready and containerized out of the box, so choosing a compute framework will become even more critical for these workloads. Engines like Presto, which are multi-cloud ready and container friendly, will become the core engine of these multi-cloud containerized workloads.
  • Less Complexity, More Kubernetes-ity for SaaSContainers provide scalability, portability, extensibility and availability advantages, but managing them is not seamless and, in fact, is often a headache. Kubernetes takes that pain away for building, delivering and scaling containerized apps. 2021 will bring more managed SaaS apps running on K8s, and those that are able to abstract the complexities of their platforms from users will emerge as the winners.
  • The New “In-VPC” Deployment ModelAs cloud adoption has become mainstream, companies are creating and storing the majority of their data in the cloud, especially in cost-efficient Amazon S3-based data lakes. To address data security concerns, these companies want to remain in their own Virtual Private Cloud (VPC). As a result, 2021 will bring in a new cloud-native architecture model for data-focused managed services, which I’m calling the “In-VPC” deployment model. It separates the control plane from the compute and data planes for better security and cleaner management.

Dave Simmen, Co-founder and Chief Technology Officer, outlines the major trends he sees on the horizon in 2021:

  • The Next Evolution of Analytics Brings a Federated, Disaggregated Stack – A federated, disaggregated stack that addresses the new realities of data is displacing the traditional data warehouse with its tightly coupled database. The next evolution of analytics foresees that a single database can no longer be the solution to support a wide range of analytics as data will be stored in both data lakes and a range of other databases. SQL analytics will be needed for querying both the data lake and other databases. We’ll see this new disaggregated stack become the dominant standard for analytics with SQL-based technologies like the Presto SQL query engine at the core, surrounded by notebooks like Jupyter and Zeppelin and BI tools like Tableau, PowerBI, and Looker.
  • SQL Is the New…SQL As companies shift their data infrastructure to a federated (one engine queries different sources), disaggregated (compute is separate from storage is separate from the data lake) stack, we’ll see traditional data warehousing and tightly coupled database architectures relegated to legacy workloads. But one thing will remain the same when it comes to this shift – SQL will continue to be the lingua franca for analytics. Data analysts, data engineers, data scientists and product managers along with their database admins will use SQL for analytics.

Tweet this: @AhanaIO announces 2021 #Data Predictions #cloud #opensource #analytics https://bit.ly/39j6HHl

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto for Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Ahana is a premier member of Linux Foundation’s Presto Foundation and actively contributes to the open source PrestoDB project. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

What is DB Presto Online?

If you are looking for online resources for Presto like the docs, then start here https://prestodb.io/ 

If you are looking for information on how to tell if your Presto cluster is online or offline then read on.

Presto Mac/Brew Install

If you installed Presto on your Mac using brew then the service is called presto-server and you manage it like this:

$ presto-server run

To tell if your presto server is online:

$ presto-server status
Not Running

Presto manual Install 

If you are running a regular deployment of open source Presto you start/stop/check status with:

$ bin/launcher stop
$ bin/launcher start
$ bin/launcher status
Running as 61824

To be sure your presto server is online point your browser at its host IP (e.g.  localhost:8080) and you should see the Presto Console’s “Cluster Overview” UI.

Advanced SQL Queries with Presto

Advanced SQL features and functions are used by analysts when, for example, complex calculations are needed, or when many tables (perhaps from multiple sources) need to be joined, when dealing with nested or repeated data, dealing with time-series data or complex data types like maps, arrays, structs and JSON, or perhaps a combination of all these things.  

Presto’s ANSI SQL engine supports numerous advanced functions which can be split into the following categories – links to the PrestoDB documentation are provided for convenience:

Running advanced SQL queries can benefit greatly from Presto’s distributed, in-memory processing architecture and cost-based optimizer. 

Presto Platform Overview

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes. Some of the world’s most innovative and data-driven companies like Facebook, Walmart and Netflix depend on Presto for querying data sets ranging from gigabytes to petabytes in size. Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.

Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Presto allows querying data where it lives, including Hive, Cassandra, relational databases, HDFS, object stores, or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization. Presto is an in-memory distributed, parallel system. 

Presto is targeted at analysts who expect response times ranging from sub-second to minutes. Presto breaks the false choice between having fast analytics using an expensive commercial solution or using a slow “free” solution that requires excessive hardware. A single Presto query can combine data from multiple sources. 

The Presto platform is composed of:

  • Two types of Presto servers: coordinators and workers. 
  • One or more connectors: Connectors link Presto to a data source such as Hive or a relational database. You can think of a connector the same way you think of a driver for a database. 
  • Cost Based Query Optimizer and Execution Engine. Parser. Planner. Scheduler.
  • Drivers for connecting tools, including JDBC. The Presto-cli tool. The Presto Console. 

In terms of organization the community owned and driven PrestoDB project is supported by the Presto Foundation, an independent nonprofit organization with open and neutral governance, hosted under the Linux Foundation®. Presto software is released under the Apache License 2.0.

Ahana offers a managed service for Presto in the cloud. You can get started for free today.

0 to Presto in 30 minutes with AWS and Ahana Cloud
Video & Slides

Video
Slides

How To Stop Presto

If you are using the presto-admin tool this is how to stop Presto safely:

$ presto-admin server stop

In addition these commands are also useful:

presto-admin server start
presto-admin server restart
presto-admin server status

If you are using Presto on EMR, you can restart Presto with:

sudo stop presto
sudo start presto

(If you have changed any configuration params you should do this on every node where you made a change).

The above “How To Stop Presto” information is correct for PrestoDB. But other “hard forks” of Presto may use different methods to stop and start. 

Presto New Releases

Where is the latest release of PrestoDB? And where can I find the release notes? Where is the documentation? These are common questions with easy answers. Presto’s web site https://prestodb.io/ and GitHub have all the information you need:

Releases: https://github.com/prestodb for the Presto and Presto-admin repositories.

Release Notes: https://prestodb.io/docs/current/release.html 

Documentation: https://prestodb.io/docs/current/ 

The above list has all the main resources you need for working with Presto’s New Releases. 

How Much Memory Should I Give A Presto Worker Node?

Presto is an in-memory query engine and so naturally memory configuration and management is important.  

JVM Memory

Presto’s JVM memory config nearly always needs to be configured – you shouldn’t be running Presto with its default setting of 16GB of memory per worker/coordinator. 

The Xmx flag specifies the maximum memory allocation pool for a Java virtual machine. Change the  -Xmx16G in the jvm.config file to a number based on your cluster’s capacity, and number of nodes.  See https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html on how to do this.  

Rule of Thumb

It is recommended you set aside 15-20% of total physical memory for the OS.  So for example if you are using EC2 “r5.xlarge” instances which have 32GB of memory,  32GB-20% = 25.6GB so you would use  -Xmx25G in the jvm.config file for coordinator and worker (or  -Xmx27G if you want to go with 15% for the OS).

This is assuming there are no other services running on the server/instance, so maximum memory can be given to Presto. 

Presto Memory 

Like with JVM above, there are two memory related settings that you should check before starting Presto. For most workloads Presto’s other memory settings will work perfectly well when left at their defaults. There are configurable parameters that control memory allocation that could be useful for specific workloads however.  The practical guidelines below will 1) help you decide if you need to change your Presto memory configuration, and 2) which parameters to change.

Workload Considerations

You may want to change Presto’s memory configuration to optimise ETL workloads versus analytical workloads, or for high query concurrency versus single-query scenarios.  There’s a great in-depth blog on Presto’s memory management written by one of Presto’s contributors at  https://prestodb.io/blog/2019/08/19/memory-tracking which will guide you in making more detailed tweaks. 

Configuration Files & Parameters 

When first deploying Presto there are two memory settings that need checking. Locate the config.properties files for both the coordinator and worker. There are two important parameters here: query.max-memory-per-node and query.max-memory.  Again see https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html for rules-of-thumb and how to configure these parameters based on the available memory. 

The above guidelines and links should help you when considering how much memory should you give a worker node.   To avoid all memory configuration work we recommend using Ahana Cloud for Presto – a fully managed service for Presto, that needs zero configuration.  

Ahana Announces General Availability of Managed Service for Presto on AWS; Delivers Combined Solution with Intel to Drive Adoption of Open Data Lakes Analytics

San Mateo, Calif. – December 9, 2020 Ahana, the self-service analytics company for Presto, announced today the General Availability of Ahana Cloud for Presto, the first cloud-native managed service focused on Presto on Amazon Web Services (AWS). Additionally, Ahana announced a go-to-market solution in collaboration with Intel via its participation in the Intel Disruptor Program to offer an Open Data Lake Analytics Accelerator Package for Ahana Cloud users that leverages Intel Optane on the cloud with AWS. 

Ahana Cloud for Presto is the only easy-to-use, cloud-native managed service for Presto and is deployed within the user’s AWS account, giving customers complete control and visibility of clusters and their data. In addition to the platform’s use of Amazon Elastic Kubernetes Services (Amazon EKS) and a Pay-As-You-Go (PAYGO) pricing model in AWS Marketplace, the new release includes enhanced manageability, security and integrations via AWS Marketplace.

 Ahana Cloud for Presto includes: 

  • Easy-to-use Ahana SaaS Console for creation, deployment and management of multiple Presto clusters within a user’s AWS account bringing the compute to user’s data 
  • Support for Amazon Simple Storage Service (Amazon S3), Amazon Relational Database (Amazon RDS) for MySQL, Amazon RDS for PostgreSQL and Amazon Elasticsearch 
  • Click-button integration for user-managed Hive Metastores and Amazon Glue
  • Built-in hosted Hive Metastore that manages metadata for data stored in Amazon S3 data lakes 
  • Pre-integrated and directly queryable Presto query log and integrations with Amazon CloudWatch 
  • Cloud-native, highly scalable and available containerized environment deployed on Amazon EKS

“With Ahana Cloud being generally available, the power of Presto is now accessible to any data team of any size and skill level. By abstracting away the complexities of deployment, configuration and management, platform teams can now deploy ‘self-service’ Presto for open data lake analytics as well as analytics on a range of other data sources,” said Dipti Borkar, Cofounder and Chief Product Officer, Ahana. “Users are looking for analytics without being locked-in to proprietary data warehouses. This offering brings a SaaS open source analytics option to users with Presto at its core, using open formats and open interfaces.”

“As Ahana Cloud users, we saw from day one the value the platform brings to our engineering team,” said Kian Sheik, Data Engineer, ReferralExchange. “Within about an hour we were up and running Presto queries on our data, without having to worry about Presto under the covers. With out-of-the-box integrations with a Hive data catalog and no configurations needed, Ahana Cloud takes care of the intricacies of the system, allowing our team to focus on deriving actionable insights on our data.”

Ahana also announced its participation in the Intel Disruptor Program to drive the adoption of Open Data Lake Analytics. Together, Ahana and Intel will offer an Open Data Lake Analytics Accelerator Package, available for Ahana Cloud users that leverage Intel Optane on AWS. It includes special incentives and PAYGO pricing. An Open Data Lake Analytics approach is a technology stack that includes open source, open formats, open interfaces, and open cloud, a preferred approach for companies that want to avoid proprietary formats and technology lock-in that come with traditional data warehouses. The offering is aimed at improving joint customers’ experience of running Presto in the cloud to help power the next generation of analytics use cases.

“We look forward to working with Ahana and helping bring this compelling open data lake analytic solution to market,” said Arijit Bandyopadhyay, CTO of enterprise analytics & AI within Intel’s data platform group. “As more companies require data to be queried across many different data sources like Amazon S3, Amazon Redshift and Amazon RDS, Presto will become even more mission critical. Intel Optane coupled with the Ahana Cloud platform provides superior analytical performance and ease of use for Presto, enabling data-driven companies to query data in place for open data lake analytics.”

Availability and Pricing
Ahana Cloud for Presto is available in AWS Marketplace, with support for Microsoft Azure and Google Cloud to be added in the future. Ahana Cloud for Presto is elastically priced based on usage, with PAYGO and annual options via AWS Marketplace, starting from $0.25 per Ahana cloud credit. 

The Ahana/Intel Open Data Lakes Accelerator Package is available today via an AWS Marketplace Private Offer.

Supporting Resources

Tweet this: .@AhanaIO announces GA of #ManagedService for #Presto; Joins Intel Disruptor Program #cloudnative #opensource #analytics @prestodb https://bit.ly/2LiE6cL

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto for Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Ahana is a premier member of Linux Foundation’s Presto Foundation and actively contributes to the open source PrestoDB project. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

How do I show tables from all schemas with Presto?

In Presto it is straightforward to show all tables in a schema e.g. If we have a MySQL data source/catalog that has a “demo” schema we use show tables in mysql.demo; but this only reveals the tables managed by that data source.

There is no equivalent way to show all tables in all schemas for a data source. However there’s the metastore to fall back on which we can query: In MySQL, Glue, Hive and others there is a schema called “information_schema” which contains a table called “tables”.  This maintains a list of  schemas and tables relating to that data source.

For a MySQL data source:

presto> select table_schema, table_name from mysql.information_schema.tables order by 1,2;
    table_schema    |                  table_name                   
--------------------+-----------------------------------------------
 demo               | geography                                     
 demo               | state                                         
 demo               | test                                          
 information_schema | applicable_roles                              
 information_schema | columns                                       
 information_schema | enabled_roles                                 
 information_schema | roles                                         
 information_schema | schemata                                      
 information_schema | table_privileges                              
 information_schema | tables                                        
 information_schema | views                                         
 sys                | host_summary                                  
 sys                | host_summary_by_file_io                       
 sys                | host_summary_by_file_io_type                  
...

For Ahana’s integrated Hive metastore:

presto:demo> select table_schema, table_name from ahana_hive.information_schema.tables order by 1,2;
    table_schema    |        table_name         
--------------------+---------------------------
 csv_test           | yellow_taxi_trips         
 csv_test           | yellow_taxi_trips_orc     
 csv_test           | yellow_taxi_trips_staging 
 information_schema | applicable_roles          
 information_schema | columns
...    

This should help you show tables from all schemas.

How do I convert Bigint to Timestamp with Presto?

UNIX timestamps are normally stored as doubles. If you have UNIX timestamps stored as big integers then you may encounter errors when trying to cast them as timestamps:

presto> select col1 from table_a;
     col1      
------------
 1606485526 
 1606485575 
 
presto> select cast(col1 as timestamp) from table_a;
Query 20201127_150824_00052_xnxra failed: line 1:8: Cannot cast bigint to timestamp

There is a solution!  Presto’s from_unixtime() function takes a UNIX timestamp and returns a timestamp:

presto> select col1,from_unixtime(col1) as ts from table_a;
    col1    |          ts          
------------+-------------------------
 1606485526 | 2020-11-27 13:58:46.000 
 1606485575 | 2020-11-27 13:59:35.000 

And we can optionally modify the format of the result by using date_format():

presto> select date_format(from_unixtime(col1),'%Y-%m-%d %h:%i%p') from table_a;
        _col0        
---------------------
 2020-11-27 01:58PM 
 2020-11-27 01:59PM 

That’s how to use from_unixtime() to convert a bigint to timestamp. 

How do I convert timestamp to date with Presto?

Luckily Presto has a wide range of conversion functions and they are listed in the docs.  Many of these allow us to specifically convert a timestamp type to a date type.

To test this out we can use Presto’s built-in current_timestamp function (an alias for the now() function) that returns the current system time as a timestamp:

presto> select current_timestamp as "Date & Time Here Now";
         Date & Time Here Now          
---------------------------------------
 2020-11-27 13:20:04.093 Europe/London 
(1 row)

To grab the date part of a timestamp we can simply cast to a date:

presto> select cast(current_timestamp as date) as "Today's date";
 Today's date 
--------------
 2020-11-27   
(1 row)

Or we can use date() which is an alias for cast(x as date):

presto> select date(current_timestamp) as "Today's date";
 Today's date 
--------------
 2020-11-27   
(1 row)

We can use date_format() which is one of Presto’s MySQL-compatible functions: 

presto:demo> select date_format(current_timestamp, '%Y%m%d') as "Today's date";
 Today's date  
----------
 20201127 
(1 row)

Finally we can use format_datetime() which uses a format string compatible with JodaTime’s DateTimeFormat pattern format:

presto:demo> select format_datetime(current_timestamp, 'Y-M-d') as "Date";
  Date  
----------
 2020-11-27 
(1 row)

The above 5 examples should allow you to convert timestamps to dates in any scenario.

How do I configure Case Sensitive Search with Presto?

When dealing with character data, case sensitivity can be important when  searching for specific matches or patterns. But not all databases and query engines behave in the same way. Some are case insensitive by default, some are not. How do we configure things so they behave in the way we want?

Here’s an example of why we might need to take steps to control case sensitivity. We’re accessing a MySQL database directly:

mysql> select * from state where name='iowa';
+------+----+--------------+
| name | id | abbreviation |
+------+----+--------------+
| Iowa | 19 | IA           |
+------+----+--------------+
1 row in set (0.00 sec)

MySQL is case-insensitive by default. Even though the MySQL column contains the capitalized string ‘Iowa’ it still matched the query’s restriction of ‘iowa’.  This may be acceptable, but in some use cases it could lead to unexpected results.

Using Presto to access the same MySQL data source things behave differently, and arguably, in a more expected way:

presto:demo> select * from state where name='iowa';
 name | id | abbreviation 
------+----+--------------
(0 rows)
 
Query 20201120_151345_00001_wjx6r, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:07 [1 rows, 0B] [0 rows/s, 0B/s]
 
presto:demo> select * from state where name='Iowa';
 name | id | abbreviation 
------+----+--------------
 Iowa | 19 | IA           
(1 row)

Now we only get a match with ‘Iowa’, and not with ‘iowa’. Presto has essentially made this data source (MySQL) case sensitive, even though it is exactly the same database in both the above examples, with default configurations used.

Reconfigure Case Sensitivity 

With a RDBMS like MySQL you can configure the collation setting to control if you want case sensitivity or not.  You can set the collation at the database creation or table creation level as a part of the CREATE statement. Or you can use ALTER to change the collation of a database, table or individual column. This is described in MySQL’s documentation. 

But how do you change Presto to be case-insensitive? Presto does not support collation, because it is not a database and doesn’t store data. And there is no configuration parameter that controls this.  To manage case sensitivity in Presto, and mimic collation, we rewrite the query to force case insensitivity explicitly by using:

  • simple lower() or upper() functions

Example:

select * from state where lower(name)='california';
    name      
------------
 california 
 California 
 CALIFORNIA 
(3 rows)

This query has matched any upper/lower case combination in the table, mimicking case insensitivity.

Or regular expressions. 

Example:

select * from state where regexp_like(name, '(?i)california');
    name      
------------
 california 
 California 
 CALIFORNIA 
(3 rows)

The regular expression syntax (?i) means matches are case insensitive. 

When it comes to Case Sensitive Search Configuration you are now an eXpErT.

When should I use ORC versus Parquet when using Presto?

If you’re working with open data lakes using open source and open formats, you can have multiple formats. Presto works with both. You’ll probably want to optimize for your workloads. 

Both ORC and Parquet store data in columns. Parquet is most efficient when it comes to storage and performance while ORC is ideal for storing compact data and skipping over irrelevant data without complex or manually maintained indices. For example, ORC is typically better suited for dimension tables which are slightly smaller while Parquet works better for the fact tables, which are much bigger.

What’s the advantage of having your own Hive metastore with Presto? How does it compare to Amazon Glue?

First let’s define what Apache Hive is versus Amazon Glue. Apache Hive reads, writes, and manages large datasets using SQL. Hive was built for Hadoop. AWS Glue is a fully managed ETL service for preparing and loading data for analytics. It automates ETL and handles the schemas and transformations. AWS Glue is serverless, so there’s no infrastructure needed to provision or manage it (you only pay for the resources used while your jobs are running).

Presto isn’t a database and does not come with a catalog, so you’d want to use Hive to read/write/manage your datasets. Presto abstracts a catalog like Hive underneath it. You can use the Glue catalog as the default Hive metastore for Presto.

With Ahana Cloud, you don’t really need to worry about integrating Hive and/or AWS Glue with Presto. Presto clusters created with Ahana come with a managed Hive metastore and pre-integrated Amazon S3 data lake bucket. Ahana takes care connecting external catalogs like Hive and Amazon Glue, so you can focus more on analytics and less on integrating your catalogs manually. You can also create managed tables as opposed to external tables.

How do you find out data type of value with Presto?

Presto has a typeof() function to make finding the data type of a value easy. This is particularly useful when you are getting values from nested maps for example and the data types need to be determined.

Here’s a simple example showing the type returned by Presto’s now() function:

presto:default> select now() as System_Timestamp, typeof( now() ) as "the datatype is";
           System_Timestamp            |     the datatype is      
---------------------------------------+--------------------------
 2020-11-18 15:15:09.872 Europe/London | timestamp with time zone 

Some more examples:

presto:default> select typeof( 'abc' ) as "the datatype is";
 the datatype is 
-----------------
 varchar(3)      
 
presto:default> select typeof( 42 ) as "the datatype is";
 the datatype is 
-----------------
 integer         
 
presto:default> select typeof( 9999999999 ) as "the datatype is";
 the datatype is 
-----------------
 bigint          
 
presto:default> select typeof( 3.14159 ) as "the datatype is";
 the datatype is 
-----------------
 decimal(6,5)

Armed with this info you should now be able to find out the data types of values.

How do you rotate rows to columns with Presto?

Sometimes called pivoting, here is one example of how to rotate row data with Presto.  

Suppose we have rows of data like this:

'a', 9
'b', 8
'a', 7 

We want to pivot this data so that all the ‘a’ row values are arranged in one column, and all the ‘b’ row values are in a second column like this:

ab
9
8
7

To rotate from rows to columns we will add an id to make aggregation easy. We will name the output columns a and b, and we’ll include the id in our result set. This is how we do the rotation in Presto, using VALUES() to supply the test data, and simple conditional CASE WHEN END logic:

presto:default> SELECT id
, MAX(CASE WHEN key = 'a' THEN value END) AS a
, MAX(CASE WHEN key = 'b' THEN value END) AS b 
FROM (VALUES (1, 'a', 9), (2, 'b', 8), (3, 'a', 7 )) as test_data (id, key, value) 
GROUP BY id ORDER BY id;
 
 id |  a   |  b   
----+------+------
  1 |    9 | NULL 
  2 | NULL |    8 
  3 |    7 | NULL 
(3 rows)

There are other SQL options for transforming (pivoting) rows into columns – you can use the map_agg function for example.

The code sample and description here should help when you need to rotate data from rows to columns using Presto.

How do you rotate columns to rows with Presto?

Sometimes called unpivoting, here is one example of how to rotate column data with Presto.  

Suppose we have some integer data arranged in two columns called a and b:

ab
9
8
7

We want to rotate the columns into rows like this, where for every ‘a’ column value we now see a row labeled ‘a’, and ditto for the b values:

'a', 9
'b', 8
'a', 7

To rotate from columns to rows in Presto we’ll use a CTE and VALUES() to supply the test data, and simple conditional CASE WHEN END logic coupled with a sub-select and a UNION:

presto:default> with testdata(value_a, value_b) as ( VALUES (9,null), (null,8), (7,null) ) 
select key, value from 
(select 'a' as key, value_a as value 
from testdata 
UNION select 'b' as key, value_b as value 
from testdata) 
where value is not null;
 
 key | value 
-----+-------
 a   |     9 
 b   |     8 
 a   |     7 
(3 rows)

There are other SQL options for rotating (unpivoting) columns into rows: The CROSS JOIN unnest function is similar to LATERAL VIEW explode function.

The code sample and description here should help when you need to rotate data from columns to rows using Presto.

What are the operational benefits of using a managed service for Presto with Ahana Cloud?

First let’s hear from an AWS Solution Architect: “Ahana Cloud uses the best practices of both a SaaS provider and somebody who would build it themselves on-premises. So the advantage with the Ahana Cloud is that Ahana is really doing all the heavy lifting, and really making it a fully managed service, the customer of Ahana does not have to do a lot of work, everything is spun up through cloud formation scripts that uses Amazon EKS, which is our Kubernetes Container Service. The customer really doesn’t have to worry about that. It’s all under the covers that runs in the background. There’s no active management required of Kubernetes or EKS. And then everything is deployed within your VPC. So the VPC is the logical and the security boundary within your account. And you can control all the egress and ingress into that VPC. So you, as the customer, have full control and the biggest advantage is that you’re not moving your data. So unlike some SaaS partners, where you’re required to push that data or cache that data on their side in their account, with the Ahana Cloud, your data never leaves your account, so your data remains local to your location. Now, obviously, with federated queries, you can also query data that’s outside of AWS. But for data that resides on AWS, you don’t have to push that to your SaaS provider.”

Now that you have that context, lets get more specific, let’s say you want to create a cluster initially, it’s a just a couple of clicks with Ahana Cloud. You can pick the the coordinator instance type and the Hive metastore instance type and it is all flexible. Instead of using the Ahana-provided Hive metastore, you can bring your own Amazon Glue catalog. Then of course its easy to add data sources. For that, you can add in JDBC endpoints for your databases. Ahana has those integrated in and then Ahana Cloud automatically restarts the cluster.

Compared to EMR or if you’re running with other distributions, all of this has to be done manually:

  • you have to create a catalog properties file for each data source
  • restart the cluster on your own
  • scale the cluster manually
  • add your own query logs and statistic
  • rebuild everything when you stop and restart clusters

With Ahana, all of this manual complexity is taken away. For scaling up, if you want to grow the analytics jobs over time, you can add nodes seamlessly. Ahana Cloud and other distributions can add the nodes to the cluster while your services are still up and running. But the part that isn’t seamless is when you stop the entire cluster. In addition to all the workers and the coordinator being provisioned, the configuration and the cluster connections to the data sources, and the Hive metastore are all maintained with Ahana Cloud. And so when you restart the cluster back up, all of that comes up pre-integrated with the click of a button: the nodes get provisioned again, and you have access to that same cluster to continue your analytics service. This is very important, because otherwise, you would have to manage it on your own, including the configuration management and reconfiguration of the catalog services. Specifically for EMR, for example, when you terminate a cluster, you lose track of that cluster altogether. You have to start from scratch and reintegrate the whole system.

How do you do a Lateral View Explode in Presto?

Hive’s explode() function takes an array (or a map) as input and outputs the elements of the array (map) as separate rows. Explode is a built-in Table-Generating Function (UDTF) in hive and can be used in a SELECT expression list and as a part of LATERAL VIEW.

The explode function doesn’t exist in Presto; instead we can use Presto’s similar UNNEST. 

Here’s an example using test results data in json form as input, from which we compute the average score per student.  We use the WITH clause to define a common table expression (CTE) named example with a column alias name of data. The VALUES function returns a table rowset. 

WITH example(data) as 
(
    VALUES
    (json '{"result":[{"name":"Jarret","score":"90"},{"name":"Blanche","score":"95"}]}'),
    (json '{"result":[{"name":"Blanche","score":"76"},{"name":"Jarret","score":"88"}]}')
)
SELECT n.name as "Student Name", avg(n.score) as "Average Score"
FROM example
CROSS JOIN
    UNNEST ( 
        CAST (JSON_EXTRACT(data, '$.result')
        as ARRAY(ROW(name VARCHAR, score INTEGER )))
    ) as n
--WHERE n.name='Jarret'
GROUP BY n.name;

Student Name | Average Score 
-------------+---------------
 Jarret      |          89.0 
 Blanche     |          85.5 
(2 rows)

The UNNEST function takes an array within a column of a single row and returns the elements of the array as multiple rows.

CAST converts the JSON type to an ARRAY type which UNNEST requires.

JSON_EXTRACT uses a jsonPath expression to return the array value of the result key in the data.

This code sample and description should help when you need to to do a lateral view explode in Presto.

How do you cross join unnest a JSON array?

Here’s an example using test results data in json form as input, from which we compute the average score per student.  We use the WITH clause to define a common table expression (CTE) named example with a column alias name of data. The VALUES function returns a table rowset. 

WITH example(data) as 
(
    VALUES
    (json '{"result":[{"name":"Jarret","score":"90"},{"name":"Blanche","score":"95"}]}'),
    (json '{"result":[{"name":"Blanche","score":"76"},{"name":"Jarret","score":"88"}]}')
)
SELECT n.name as "Student Name", avg(n.score) as "Average Score"
FROM example
CROSS JOIN
    UNNEST ( 
        CAST (JSON_EXTRACT(data, '$.result')
        as ARRAY(ROW(name VARCHAR, score INTEGER )))
    ) as n
--WHERE n.name='Jarret'
GROUP BY n.name;

Student Name | Average Score 
-------------+---------------
 Jarret      |          89.0 
 Blanche     |          85.5 
(2 rows)

The UNNEST function takes an array within a column of a single row and returns the elements of the array as multiple rows.

CAST converts the JSON type to an ARRAY type which UNNEST requires.

JSON_EXTRACT uses a jsonPath expression to return the array value of the result key in the data.

The UNNEST approach is similar to Hive’s explode function.

This code sample and description should help when you need to execute a cross join to unnest a JSON array. 

How can you write the output of queries to S3 easily?

With Ahana Cloud, we’ve made it easy for you write output of queries to S3. While there’s a variety of formats, here’s an example:

presto> CREATE SCHEMA ahana_hive.s3_write WITH (location = 's3a://parquet-test-bucket/');
CREATE SCHEMA

presto> CREATE TABLE ahana_hive.s3_write.my_table
WITH (format = 'PARQUET')
AS SELECT <your query here> ;

Does Amazon Athena do joins across other data sources besides S3? Does Amazon Athena connect to other data sources?

With Amazon Athena you’re limited in scope when it comes to doing joins across other data sources like relational data systems and more. You have to set up a Lambda, which then connects with your database in the back, which is an additional piece you have to manage on your own.


With Ahana Cloud, you can directly connect to many different types of data sources including S3, MySQL, PostgreSQL, Redshift, Elastic, and more with a few clicks, and you’re ready to integrate with any cluster.

If I have catalogs connected and configurations attached to my Presto cluster, what happens when I take the cluster down?

If you’re managing Presto on your own, either through your own installation or through a service like AWS EMR or AWS Athena, you have to maintain and manage all of the catalogs and configurations attached to your cluster. That means that if you take your cluster down, you’ll lose those catalogs and configurations – they are not maintained for your cluster.


You can use the Ahana Cloud managed service for Presto to help with this. Ahana Cloud manages all of that for you, so you don’t have to worry about losing catalogs and configurations attached to your Presto cluster. You also get Presto bundled with data sources like the Hive metastore, Apache Superset, and more.

Optimize Presto EMR

What is Amazon EMR?

Amazon Elastic MapReduce (EMR) simplifies running big data and analytics frameworks like Presto for scalable compute in the cloud. It provides on-demand, scalable Hadoop clusters for processing large data sets. You can move large volumes of data into and out of AWS datastores like S3 with Amazon EMR. AWS EMR uses Amazon EC2 instances for fast provisioning, scalability and high availability of compute power. 

With EMR, users can spin up Hadoop clusters and start processing data in minutes, without having to manage the configuration and tuning of each cluster node required for an on-premises Hadoop installation. Once the analysis is complete, clusters can be terminated instantly, saving on the cost of compute resources.

As a Hadoop distribution, AWS EMR incorporates various Hadoop tools, including Presto, Spark and Hive, so that users can query and analyze their data. With AWS EMR, data can be accessed directly from AWS S3 storage using EMRFS (Elastic MapReduce File System) or copied into HDFS (Hadoop Distributed File System) on each cluster instance for the lifetime of the cluster. In order to persist data stored in HDFS, it must be manually copied to S3 before the cluster is terminated.

What is Presto?

Presto is an open source, federated SQL query engine, optimized for running interactive queries on large data sets and across multiple sources. It runs on a cluster of machines and enables interactive, ad hoc analytics on large amounts of data. 

Presto enables querying data where it lives, including Hive, AWS S3, Hadoop, Cassandra, relational databases, NoSQL databases, or even proprietary data stores. Presto allows users to access data from multiple sources, allowing for analytics across an entire organization.

Using Presto on Amazon EMR

With Presto and AWS EMR, users can run interactive queries on large data sets with minimal setup time. AWS EMR handles the provisioning, configuration and tuning of Hadoop clusters. Providing you launch a cluster with Amazon EMR 5.0.0 or later, Presto is included automatically as part of the cluster software. Earlier versions of AWS EMR include Presto as a sandbox application.

AWS EMR And Presto Configurations

As a query engine, Presto does not manage storage of the data to be processed; it simply connects to the relevant data source in order to run interactive queries. In AWS EMR, data is either copied to HDFS on each cluster instance or read from S3. With EMR 5.12.0 onwards, by default Presto uses EMRFS to connect to Amazon S3. EMRFS extends the HDFS API to S3, giving Hadoop applications, like Presto, access to data stored in S3 without additional configuration or copying of the data. For earlier versions of AWS EMR, data in S3 can be accessed using Presto’s Hive connector.

Real world applications

Jampp is a mobile app marketing platform that uses programmatic ads to acquire new users and retarget those users with relevant ads. It sits between advertisers and their audiences, so real time bidding of media advertising space is critical for their business. The amount of data Jampp generates as part of the bidding cycle is massive – 1.7B events are tracked per day, 550K/sec requests are received, and 100TB of data is processed by AWS elastic load balancers per day. PrestoDB plays a critical role in their data infrastructure. Jampp relies on AWS EMR Presto for their ad hoc queries and performs over 3K ad hoc queries/day on over 600TB of queryable data.

Ahana Welcomes Database Pioneer David E. Simmen to Executive Team as Chief Technology Officer

Former Splunk Fellow and Chief Architect & Teradata Fellow brings 30+ years of database expertise to lead technology innovation

San Mateo, CA — July 23, 2020Ahana, the SQL analytics company for Presto, today announced the appointment of David E. Simmen as Chief Technology Officer. Simmen will oversee the company’s technology strategy while driving product innovation. His extensive database experience will further accelerate Ahana’s vision of simplifying ad hoc analytics for organizations of all shapes and sizes.

“Dave has a remarkable record of innovation in distributed databases, data federation and advanced analytics. That, coupled with his over 30 years of architecting and building in the field, is nothing short of impressive,” said Steven Mih, Co-founder & CEO of Ahana. “His experience innovating distributed database technology with Splunk, Teradata and IBM provides Ahana the depth needed for truly groundbreaking development to achieve our vision of unified, federated analytics. I couldn’t be more thrilled to have Dave at the helm as CTO and Co-founder.”

Simmen joined Ahana most recently from Apple where he engineered iCloud database services. Prior to Apple, he was Chief Architect with Splunk and named the first Fellow in the company’s history. There, he built Splunk’s multi-year technical roadmap and architecture in support of enterprise customers. Prior to Splunk, Simmen was Engineering Fellow and CTO of Teradata Aster where he set the technical direction and led the architecture team that built advanced analytics engines and storage capabilities in support of big data discovery applications.

Earlier in his career, Simmen was a Senior Technical Staff Member (STSM) at IBM Research adding innovations to DB2 like the IBM Starburst SQL compiler, support for MPP and SMP, database federation, and many more. Simmen is also a named inventor on 37 U.S. patents and has 15 publications to his name including Fundamental Techniques for Order Optimization, Robust Query Processing through Progressive Optimization, and Accelerating Big Data Analytics With Collaborative Planning.

“Given the spread of data across data lakes, data warehouses, and the plethora of relational and non-relational data services being offered today, the need for unified, federated analytics is only increasing. Presto is one of the most popular federated query engines but it’s missing fundamental optimizations developed over the decades by academic researchers and industry pioneers,” said Simmen. “That needs to be applied to Presto, making it even more powerful to eventually become the heart of the unified, federated analytical stack. I look forward to working with the Ahana team to turn this vision into reality.”

“In my time with Dave at IBM’s Almaden Research Center and then with the DB2 optimizer team at Silicon Valley Laboratory, I saw firsthand his technical expertise and his broader leadership skills,” said Dr. Laura Haas, Dean of College of Information and Computer Sciences at UMass Amherst and retired IBM Fellow. Dean Haas is best known for her work on the Starburst query processor and Garlic, a federated engine. 

Dean Haas continued, “His combined leadership and technical depth is unparalleled, and it’s wonderful that he has joined Ahana to drive innovation for the next generation of data federation technology for analytics. I look forward to seeing his vision come to life in this space.”

Resources:

Download a headshot of David E. Simmen.

Links to David Simmen’s published research papers 

Tweet this:  .@AhanaIO welcomes Database Pioneer David E. Simmen to Executive Team as #CTO #database #opensource #analytics #cloud bit.ly/30xmwWr

About Ahana

Ahana, the SQL analytics company for Presto, is focused on evangelizing the Presto community and bringing simplified ad hoc analytics offerings to market. As the Presto market continues to grow exponentially, Ahana’s mission is to enable unified, federated analytics. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV and Leslie Ventures. Follow Ahana on LinkedIn and Twitter.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Ahana Announces Linux Foundation’s PrestoDB Now Available on AWS Marketplace and DockerHub

Ahana to provide community and commercial support for PrestoDB users 

San Mateo, CA — June 30, 2020Ahana, the Presto-based analytics company, announced today the availability of the completely open source PrestoDB Amazon Machine Image (AMI) on the AWS Marketplace as well as a PrestoDB container on DockerHub. The PrestoDB AMI is the first and only completely open source and completely free Presto offering available on the AWS Marketplace. These free PrestoDB offerings make it significantly easier for data platform teams to get started with Presto in the cloud particularly for interactive, ad hoc analytics on S3 data lakes and other popular data sources like AWS RDS, Redshift, Amazon’s Elasticsearch service and others. 

Additionally, Ahana will provide commercial support for users who require technical help and management of their container or AMI-based PrestoDB clusters.

PrestoDB is a federated SQL engine for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data, which continues to grow exponentially across a wide range of data lakes and databases. As a result, data platform teams are increasingly using Presto as the de facto SQL query engine to run analytics across data sources in-place, without the need to move data. One of the fastest growing projects in the data analytics space, PrestoDB is hosted by the Linux Foundation’s Presto Foundation and is the same project running at massive scale at Facebook, Uber and Twitter. 

“We’re looking forward to making it easier for the Presto community to get started with Presto,” said Dipti Borkar, co-founder and Chief Product Officer at Ahana and Outreach Committee Chairwoman of the Presto Foundation. “This is the first and only fully open source version of Presto that’s available on the AWS Marketplace, and we hope to see continued growth and adoption of PrestoDB as the federated query engine of choice for analytics across data lakes and database.”

In addition to the PrestoDB AMI and container, new Presto Sandbox offerings are also available to users getting started with Presto. These include a new Sandbox AMI on the AWS Marketplace and a Sandbox container on DockerHub. They come preconfigured with the Hive Metastore catalog to allow users to query data easily from S3, MySQL and other data sources as well as with built-in data sets like the TPC-DS benchmark data.

The new AWS AMIs and container offerings are yet another way for PrestoDB users to install and get started with the software. Other options include manual installation and using Homebrew. All variations offer PrestoDB as a completely open source technology.

“Having tracked the data space for more than two decades, I’ve seen wave after wave of innovation. What Presto brings to the table is something we’ve long sought: an efficient engine for federated queries across widely disparate datasets,” noted Eric Kavanagh, CEO of The Bloor Group. “This allows organizations to retain their investments in databases, data warehouses, data lakes, and so-called lake houses, while expediting and amplifying business value. The approach Ahana is taking around making PrestoDB easier to use and providing commercial support are key pieces in helping widen the adoption of the technology.” 

Two Presto offerings, one name

Despite similar names, PrestoDB and PrestoSQL are two different offerings. While other variations of Presto are available on the marketplace like the Starburst Data AMI (based on PrestoSQL), they are paid offerings with proprietary features. The PrestoDB AMI, on the other hand, is 100% open source and available for use in production immediately. 

As the original project that came out of Facebook in 2013, PrestoDB is hosted under the auspices of the Linux Foundation’s Presto Foundation. The PrestoSQL fork was announced in 2019 and is backed by the Presto Software Foundation, led by the original creators of Presto who left Facebook.

Resources

Tweet this:  .@AhanaIO announces @prestodb now available on @awsmarketplace; Ahana will provide support #opensource #analytics #cloud https://bit.ly/2BGPudo

About Ahana

Ahana, the Presto-based analytics company, is the only company focused on unifying the PrestoDB community and bringing PrestoDB-focused ad-hoc analytics offerings to market. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

bethwinkowski@gmail.com