On-Demand Presentation

As we see more companies augment the traditional cloud data warehouses and in some cases replace their data warehouses with cloud data lakes, a new stack has emerged that supports data warehouse workloads that weren’t possible on a data lake before while bringing a lot more advantages like lower cost, flexibility and no lock-in with open formats and open interfaces. 

The new stack: Presto + Apache Hudi + AWS Glue and S3 = The PHAS3 stack

Unlike the cloud data warehouse which is closed source, has data stored in proprietary formats, tends to be very expensive, and assumes data needs to be ingested and integrated into one database to provide the critical business insights for decision-making, the PHAS3 stack is open, flexible, and affordable.

In this roundtable discussion experts from each layer in this stack – Presto, AWS, and Apache Hudi – discuss why we’re seeing a pronounced adoption to this next generation of cloud data lake analytics and how these technologies enable open, flexible, and highly performant analytics in the cloud.

Speakers

Vinoth Chandar
Creator of Hudi

Roy Hasson
Principal Product Manager

Dipti Borkar
Cofounder & CPO

Eric Kavanagh
Moderator

How do I query a data lake with Presto?

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. Structured and semi-structured data can be queried by Presto, an open source SQL engine. This allows users to store data as-is, without having to first structure the data, and run different types of analytics. 

To query this data in data lakes, the following technologies are needed. 

  1. A SQL query engine – Presto was built for querying data lakes like HDFS and now increasingly AWS S3 and Google Cloud Platform – Google Cloud Storage and others. 
  2. A big data catalog – there are two popular big data catalog systems also called metastores – the Hive Metastore and AWS Glue service. 
  3. Buckets in the data lake like AWS S3 

What types of data can be queried by Presto? 

The following file formats can be queried by Presto 

  1. ORC.
  2. Parquet.
  3. Avro.
  4. RCFile.
  5. SequenceFile.
  6. JSON.
  7. Text.
  8. CSV

How does it work? 

First, data in the data lake needs to be mapped into tables and columns. This is what the Hive Metastore and AWS Glue catalogs help with. Example, if there is a CSV file, once Presto, Hive Metastore and Glue are integrated, users can use Presto commands to create a schema and then the create table statement and map the file to a table and columns. 

Example:

SHOW CATALOGS; 
USE ahana_hive.default; 

CREATE TABLE user ( 
	registration_dttm 	timestamp, 
	id 					int,
	first_name 			varchar,
	last_name 			varchar,
	email 				varchar,
	gender 				varchar,
	ip_address 			varchar,
	cc 					varchar,
	country 			varchar,
	birthdate 			varchar,
	salary 				double,
	title 				varchar,
	comments 			varchar
) WITH ( 
format = CSV, 
skip_header_line_count = 1,
external_LOCATION = 's3a://ahana/userdata/' );

Once the table is created and mapped to the external location, it can immediately be queried. 

Example: 

Select * from user; 

You can run these commands in Presto using the Presto-cli. More information in the docs.

If you are looking for better performance, it is recommended to convert formats like JSON, CSV into more optimized formats like Apache Parquet and Apache ORC. This will improve query performance greatly. This can also be done with Presto using the CREATE TABLE AS command. More information on this here

Ahana Cloud makes it very easy to query a data lake with Presto. It is a managed service for Presto and also comes with a built-in Hive Metastore so that you don’t need to deploy and manage one. In addition, it can also integrate with AWS Glue. Getting started with Ahana Cloud is easy. Here’s how: https://ahana.io/docs/getting-started

Additional resources: 

Presto Docker Container 
Presto-cli Docker Container 
Presto Sandbox Docker Container 

Why am I getting a Presto EMR S3 timeout error?

If you’re using AWS EMR Presto, you can use the S3 select pushdown feature to push down compute operations (i.e. SELECT) and predicate operations (i.e. WHERE) to S3. Pushdown makes query performance much faster because it means queries will only retrieve required data from S3. It also helps in reducing the amount of data transferred between EMR Presto and S3.

If you’re using pushdown for EMR Presto and seeing a timeout error, there might be a few reasons for that. Because Presto uses EMRFS as its file system, there’s a maximum allowable number of client connections to S3 through EMRFS for Presto (500). When using S3 Select Pushdown, you bypass EMRFS when you access S3 for predicate operations so the value of hive.s3select-pushdown.max-connections is what will determine the max number of client connections allowed by worker nodes. Requests that aren’t pushed down use the value of fs.s3.maxConnections.

At this point you might get an error that says “timeout waiting for connection from pool”. That’s because you need to increase the value of both of those values above. Once you do that, that should help solve this problem.

Errors like these are common with Presto EMR. EMR is complex and resource-intensive, and there’s a lot you have to understand when it comes to the specific config and turning parameters for Hadoop. Many companies have switched from EMR Presto to Ahana Cloud, a managed service for Presto on AWS that is much easier to use. Ahana Cloud is a non-Hadoop deployment in a fully managed environment. Users see up to 23x performance with Ahana’s built-in caching.

Check out some of the differences between Presto EMR and Ahana Cloud. If you’re using EMR Presto today, Ahana Cloud might help with some of those pain points. Additionally, Ahana is pay-as-you-go pricing and it’s easy to get started if you’re already an EMR user. 

Ahana Demonstrates Major Momentum in Customer and Community Adoption for Presto 1H 2021

The Presto company also shows significant product momentum with numerous accolades and industry recognition 

San Mateo, Calif. – June 24, 2021 — Ahana, the Presto company, today announced major momentum in customer and community adoption for the first half of the year. Ahana Cloud for Presto has seen strong adoption across many verticals in the mid-size and enterprise markets for its easy to use and high performance cloud managed service to query AWS S3 data lakes. 

“With the rapid growth of data lakes today, companies are turning to SQL query engines like Presto to get fast insights directly on their data lakes and with other data sources,” said Steven Mih, CEO of Ahana. “Presto is increasingly becoming the de facto choice for SQL queries on the data lake because of its performance and open, flexible architecture. But for most companies, leveraging Presto can be complex and resource-intensive, and that’s where Ahana can help in making it incredibly easy to get the power of Presto for your AWS S3-based open data lake analytics. The momentum we’ve seen in the Presto community coupled with Ahana customer adoption and industry accolades is a testament to how critical Presto is to unlock data insights on the now ubiquitous data lake.”

Continuing Customer Success and New Customer Wins

Ahana has continued to grow its customer base across all verticals in the mid-size and enterprise markets, including companies in the telco, FinServ, AdTech, and security industries, and today there are dozens of companies using Ahana Cloud for Presto on AWS. Recent notable additions and success stories include Securonix, Dialog, Carbon, Rev, Metropolis, Requis, and Cartona.

Earlier this year, ad tech company Carbon shared at PrestoCon why they chose Ahana Cloud for Presto to power their customer-facing dashboards and eCommerce company Cartona presented their Ahana Cloud for Presto use case. Securonix, a leading security operations and analytics company, is one of the latest companies to deploy Ahana Cloud for Presto.

Sachin Nayyer, CEO at Securonix, said at the AWS Startup Showcase featuring Ahana, “We are very excited about our partnership with Presto and Ahana because Ahana provides us the ability to cloudify Presto, in addition to being our conduit to the Presto community. We believe this is the analytics solution of the future, and with Ahana for Presto we’re able to offer our customers data that’s queryable at an extremely fast speed at very reasonable price points. That has significant benefits for our customers.”

Open Source Presto Community Momentum

The Presto open source community has also continued to grow exponentially over the course of the year. March’s PrestoCon Day was the largest Presto event to date and featured sessions from Facebook on the Presto roadmap and Twitter on the RaptorX project, plus panel discussions on the Presto ecosystem and Presto, Today and Beyond.

At Percona Live Online, the biggest open source database conference in the world, the Presto community track had hundreds of attendees and featured sessions like the Kubernetes operator for Presto, how Facebook’s usage of Presto drives innovation, and many more from presenters at Facebook, Twitter, AWS and more.

Additionally, the Docker Sandbox Container for Presto hosted by Ahana has seen hundreds of thousands of pulls over the course of the year, demonstrating significant growth in Presto usage.

Global Presto meetups have grown in size to over 1,000 members across the globe in cities like New York City, London, Bangalore, Sydney, and more, and Presto Foundation membership has grown to ten companies with new members Intel and Hewlett Packard Enterprise.

“As a member of the Presto Foundation, Intel is committed to working with the Presto open source project along with Ahana, Facebook, Uber and others to drive even more innovation and community engagement,” said Arijit Bandyopadhyay, CTO – Enterprise Analytics & AI, Head of Strategy – Enterprise & Cloud, Data Platforms Group at Intel Corporation. “We look forward to continuing to build on the fantastic momentum thus far and helping even more developers and enterprises get up and running with Presto.”

“The engagement we’re seeing within the community at meetups and events like PrestoCon Day and Percona Live coupled with the usage of the Docker container are indicative of how much the Presto project continues to grow,” said Dipti Borkar, Chairperson of the Presto Foundation Outreach Committee and Cofounder and Chief Product Officer, Ahana. “As Presto continues to be the de facto query engine for the data lake, we look forward to continuing to build a robust and vibrant community around the project and expand the use cases it supports.”

Product Accolades and Industry Recognition 

Ahana has also received numerous new editorial and industry awards in 2021, including:

Tweet this: @AhanaIO announces major customer and community adoption for 1H 2021 #presto #cloud #AWS #datalake https://bit.ly/2TXBUeW

About Ahana

Ahana, the Presto company, offers the only managed service for Presto on AWS with the vision to simplify open data lake analytics. Presto, the open source project created by Facebook and used at Uber, Twitter and thousands more, is the de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the easiest Presto SaaS and enables data platform teams to provide high performance SQL analytics on their S3 data lakes and other data sources. As a leading member of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also focused on fostering growth and evangelizing open source Presto. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Streaming Data Processing Using Apache Kafka and Presto

Kafka Quick Start

Kafka is a distributed data streaming framework meant to enable the creation of highly scalable distributed systems. Developed at LinkedIn in 2008 and open-sourced in 2011, it was created to enable the creation of  decoupled yet conceptually connected systems. Broken down to the simplest level, Kafka provides a consistent, fast, and highly scalable log. Specifically, it is a commit log whereby all writes are guaranteed to be ordered and one cannot delete or modify the entries.

Once entries are added to the log, different systems can then process the entries, communicating with each other as needed, most likely by adding entries to the Kafka commit log. This enables the creation of software as a system of systems. Communication and processing happen in parallel and asynchronously, enabling each system to be developed, maintained, scaled, and enhanced as needed. Some of the companies using Kafka include Coursera, Netflix, Spotify, Activision, Uber, and Slack. 

Inner Workings Of Kafka

Kafka consists of producers that send messages to a Kafka node. These messages are grouped by topics to which consumers are subscribed. Each consumer receives all the messages sent to the topics it is subscribed to and carries out further processing as required. All the messages sent to the broker are stored for a given time or until they reach a given size on disk. Deployment is done in a cluster consisting of several brokers to ensure there is no single point of failure.

Messages sent to topics are split into partitions that are replicated in several nodes. The replication factor is determined by the performance and resilience requirements of the data/system. At any moment, one Kafka broker acts as the partition leader that owns the partition.  It is the node to which producers write their messages and consumers read them.

What is Presto?

Presto is a distributed query engine that allows the use of ANSI SQL to query data from multiple data sources. It holds processing and query results in memory, making it extremely efficient and fast. A presto cluster consists of a coordinator node and multiple worker nodes. The worker nodes are responsible for connecting to data stores via plugins/connectors and query processing.

Distributed Data Processing Using Kafka and Presto

Kafka and Presto are normally combined with Kafka providing real-time data pipelines and Presto provisioning distributed querying. This is easily achieved using Presto’s Kafka connector that provides access to Kafka topics. It is also possible to have Presto as the producer sending messages to Kafka which are processed by other applications like business intelligence (BI) and machine learning (ML) systems.

To connect Presto and Kafka, you need to have the Kafka cluster running. One then adds a catalog file with the connector.name value set to Kafka, then add the kafka.table-names which lists the topics from the cluster and kafka.nodes property that contains the nodes/s. If multiple Kafka clusters are available, connection with Presto is achieved by adding uniquely named catalog files for each cluster.

Get Started with Presto & Apache Kafka

Business Intelligence And Data Analysis With Druid and Presto

Apache Druid Helicopter View

Apache Druid is a distributed, columnar database aimed at developing analytical solutions. It offers a real-time analytics database able to ingest and query massive amounts of data quickly and store the data safely. It was developed by Metamarkets in 2011, open-sourced in 2012, and made an Apache project in 2015. Some of the companies using Druid include Paypal, Cisco, British Telecom (BT), Reddit, Salesforce, Splunk, Unity, and Verizon.

Druid incorporates ideas from data warehousing, cloud computing, and distributed systems. Its architecture provides many characteristics and features that make it a top candidate for an enterprise, real-time data analysis datastore. Druid runs on a cluster of multiple nodes, offering high scalability, concurrency, availability, and fault tolerance.

The Apache Druid Cluster Architecture

Druid nodes are of various types, each serving a specialized function. Realtime node read and index streaming data, creating segments stored until forwarded to historical nodes. Historical nodes store and read immutable data segments in deep storage like S3 and HDFS. Coordinator nodes handle data management by handling features like segment-historical node assignment, load balancing, and replication.

Overlord nodes handle a Druid’s cluster task and data ingestion management. They assign the tasks to Middle Manager nodes that process the tasks and provide features like indexing in real-time. Broker nodes provide an interface between the cluster and clients and accept queries, send them to the appropriate real-time/historical nodes, accept the query results, and return the final results to the client. Druid has optional Router nodes providing proxying services for request management to Overlord and Coordinator nodes and query routing services to Broker nodes.

What is Presto?

Presto is an open source SQL query engine built for data lake analytics and ad hoc query. It was developed to meet Facebook OLAP needs against their Hive data lake. Its design goals include fast and parallel query processing, creating a virtual data warehouse from disparate datastores via a plugin architecture, and having a highly scalable and distributed query engine. 

Presto is deployed in production as a cluster of nodes for improving the performance and scalability.

Druid and Presto Data Analysis Application

Druid and Presto are usually combined to create highly scalable, parallel, distributed real-time analytics, business intelligence (BI), and online analytical processing (OLAP) solutions. Since both platforms are open source, users can enjoy their power without investing in purchase/licensing costs if they’re ok managing both on their own. Having Druid processing real-time data and handling ad-hoc querying enables real-time analytics to be realized on a Presto-powered stack. Presto allows users to perform join queries from disparate data sources. Therefore, they can select the datastore that best meets their diverse needs e.g. online transaction processing (OLTP) databases like MySQL, document-orient databases like MongoDB, or/and geospatial databases like PostGIS.

Integrating Druid with Presto is done via the Presto Druid connector. This requires the creation of a catalog properties file that configures the connection. The first property is the connector.name property that needs to be set to druid.the druid.broker-url and druid.coordinator-url accepts the URL to the broker and coordinator respectively in the hostname:port format. Query pushdown is enabled by setting druid.compute-pushdown-enabled to true.

Get Started with Presto & Druid

Flexible And Low Latency OLAP Using Apache Pinot and Presto for real time analytics

Apache Pinot Overview

Apache Pinot is a distributed, low latency online analytical processing (OLAP) platform used for carrying out fast big data analytics. Developed at LinkedIn in 2014, the highly scalable platform is meant to power time-sensitive analytics and is designed to have low latency and high throughput. It was open-sourced in 2015 and incubated by the Apache Software Foundation in 2018. Some of its use cases include high dimensional data analysis, business intelligence (BI), and providing users with profile view metrics. Other companies using Pinot include Uber, Microsoft, Target, Stripe, and Walmart.

Simplified View Of How Apache Pinot Works

Pinot is meant to be highly scalable and distributed while providing high throughput and fast turnaround time. To achieve this, related data from streaming sources like Kafka and data lakes like S3 are stored in tables. The tables are split into segments that are sets containing non-changing tuples. Segments are stored in a columnar manner and additionally contain metadata, zone maps, and indices related to contained tuples. Segments are stored and replicated among Pinot server nodes. Controller nodes contain global metadata related to all segments in a cluster like server node to segment mapping.

Pinot consists of four main components namely brokers, servers, minions, and controllers. The controller handles cluster management, scheduling, resource allocation, and a REST API for administration. The Pinot broker is responsible for receiving client queries, sending them to servers for execution, and returning the results of the queries to the client. Servers have segments that store data and handle most of the distributed processing. They are divided into offline and real-time servers, with offline servers typically containing immutable segments and real-time servers that ingest data from streaming sources. Minions are used for maintenance tasks not related to query processing like periodically purging data from a Pinot cluster for security and regulatory compliance reasons.

What is Presto?

Presto is a fast query engine able to handle processing in a parallel and distributed manner. It’s an open source, distributed SQL query engine.

Presto architecture consists of a coordinator node and multiple worker nodes. The coordinator node is responsible for accepting queries and returning results. The worker nodes do the actual computation and connect to the data stores. This distributed architecture makes Presto fast and scalable.

Fast and Flexible OLAP With Pinot and Presto

When carrying out analytics, system designers and developers normally have to make a tradeoff between querying flexibility and fast response times. The more flexible a system is, the slower its response time. Pinot is extremely fast but has limited flexibility while Presto is a bit slower but offers more flexibility. Having a Pinot cluster as the storage layer and a Presto cluster as the querying layer provides users with high throughput, low latency storage and powerful, flexible querying. Integration is achieved using an open source Presto Pinot connector that is responsible for managing connections and mapping queries and their results between the two platforms. Optimization is achieved by query pushdown to Pinot with Presto offering features lacking in Pinot like table joins.

You can learn more about the Apache Pinot connector for Presto in the PrestoCon session presented by the Apache Pinot team.

Get Started with Apache Pinot & Presto

Turbocharge your Analytics with MongoDB And Presto

High-Level View Of MongoDB

MongoDB is a NoSQL distributed document database meant to handle diverse data management requirements. Its design goals include creating an object-oriented, highly available, scalable, efficient, and ACID (Atomicity, Consistency, Isolation, and Durability) featuring database. Its document model enables data to be stored in its most natural form as opposed to the relational model, making users more productive. It supports both schemaless and schema design, offering both flexibility as well as data integrity and consistency enforcement as needed. Some of the organizations using MongoDB include Google, SAP, Verizon, Intuit, Sega, Adobe, InVision, and EA Sports.

A Look At MongoDB Architecture

MongoDB stores data in documents in the Binary JSON (BSON) format. Logically related documents are grouped into collections that are indexed. Mongodb servers that store data form shards are grouped into replica sets. Replica sets have the same data replicated among them, with the default replication factor being 3 servers. Data is partitioned into chunks, which combined with sharding and replication provides high reliability and availability. During partitioning, consistency is ensured by having the database write unavailable. Config servers have configuration data and metadata related to the MongoDB clusters. Mongo’s Routers accept queries and return results to clients and are responsible for directing queries to the correct shards. 

MongoDB Deployment

MongoDB is cross-platform and can be installed on all major operating systems. It can either be installed manually, deployed on private and/or public clouds, or accessed via premium cloud offerings. Recommended practice in production is to have multiple nodes running MongoDB instances, forming a cluster.

What is Presto?

Presto is an open source SQL query engine that provides a scalable and high throughput query engine capable of accessing different data stores including MySQL, DB2, Oracle, Cassandra, Redis, S3, and MongoDB. This enables the creation of a virtualized data lake of all data. Combining Presto with MongoDB creates a highly scalable and cohesive yet loosely decoupled data management stack.

Scalable Analytics With MongoDB and Presto

Combining MongoDB and Presto provides a highly scalable tech stack for developing distributed analytical applications. MongoDB is an enterprise distributed database capable of storing data as strictly as users need it to be and ensure high horizontal scalability, availability, resilience, and self-healing. Designers and developers can choose the data model that best serves them, trading flexibility for strictness in the schema design and performance for transactional integrity in write operations. Different clusters can be created as needed to meet different goals as per performance and functional needs.

For example, writes can be unacknowledged, acknowledged, or replica-acknowledged, with faster writes being achieved with weaker write enforcement. Reads can be performed from secondary, primary-preferred, and primary nodes for a tradeoff between turnaround times and fetching stale data. This makes it a great storage layer for OLAP systems. Data can be persisted as accurately or read as fast as possible as per each application’s need. Integration is achieved using the Presto MongoDB connector.

Can you insert a JSON document into MongoDB with Presto?

This question comes up quite a bit. In short, yes you can do this. You’d be running an insert statement from Presto to Mongo. If you use Presto, you’d insert it as a table. For example:

INSERT INTO orders VALUES(1, 'bad', 50.0, current_date);

That insert would go into MongoDB as a JSON document.

Getting started with Presto in the cloud

If you want to get started with Presto quickly in the cloud, try out Ahana Cloud for free. Ahana takes care of the deployment, management, adding/detaching data sources, etc. for you. It’s a managed service for Presto that makes it really easy to get started. You can try it free at https://ahana.io/sign-up 

CRN® Recognizes Ahana on Its 2021 Big Data 100 List As One of The Coolest Business Analytics Companies

Ahana also named to CRN’s 10 Hot Big Data Companies You Should Watch in 2021 list

San Mateo, Calif. – May 5, 2021 — Ahana, the self-service analytics company for Presto, announced today that CRN®, a brand of The Channel Company®, recognized Ahana on its 2021 Big Data 100 list as one of theCoolest Business Analytics Companies. This annual list recognizes the technology vendors that go above and beyond by delivering innovation-driven products and services for solution providers that in turn help enterprise organizations better manage and utilize the massive amounts of business data they generate.

This recognition follows Ahana’s recent distinction by CRN as one of 10 Hot Big Data Companies You Should Watch in 2021. “We are honored to receive these prestigious accolades from one of the industry’s most influential media sources,” said Steven Mih, Cofounder and CEO, Ahana. “This is another validation of tremendous growth in users of the open source Presto project and the innovation of Ahana Cloud for Presto, which brings the power of the most powerful open source distributed SQL query engine to any organization.”

Ahana Cloud for Presto is the first and only cloud-native managed service for Presto on Amazon Web Services (AWS), giving customers complete control and visibility of clusters and their data. Presto is an open source distributed SQL query engine for data analytics. With Ahana Cloud, the power of Presto is now accessible to any data team of any size and skill level.

A team of CRN editors compiled this year’s Big Data 100 list by identifying IT vendors that have consistently made technical innovation a top priority through their offering of products and services for business analytics, systems and platforms, big data management and integration tools, database systems, and data science and machine learning. Over the years, the Big Data 100 list has become an invaluable resource for solution providers that trust CRN to help them find vendors that specialize in data intelligence, insights, and analytics.

“IT vendors featured on CRN’s 2021 Big Data 100 list have demonstrated a proven ability to bring much-needed innovation, insight and industry expertise to the solution providers and customers that need it most,” said Blaine Raddon, CEO of The Channel Company. “I am honored to recognize these companies for their unceasing commitment toward elevating and improving the ways businesses gain value from their data.”

The 2021 Big Data 100 list is available online at https://www.crn.com/news/cloud/the-big-data-100-2021

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto for Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Ahana is a premier member of Linux Foundation’s Presto Foundation and actively contributes to the open source PrestoDB project. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

About The Channel Company®

The Channel Company enables breakthrough IT channel performance with our dominant media, engaging events, expert consulting and education, and innovative marketing services and platforms. As the channel catalyst, we connect and empower technology suppliers, solution providers, and end users. Backed by more than 30 years of unequaled channel experience, we draw from our deep knowledge to envision innovative new solutions for ever-evolving challenges in the technology marketplace. www.thechannelcompany.com

Follow The Channel Company®: Twitter, LinkedIn, and Facebook

© 2021 The Channel Company, LLC. CRN is a registered trademark of The Channel Company, LLC. All rights reserved.

The Channel Company Contact:

Jennifer Hogan

The Channel Company

jhogan@thechannelcompany.com

Ahana Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

How do I sync my partition and metastore in Presto?

Sync partition metadata is used to sync the metastore with information on the file system/s3 for the external table. Depending upon the number of partitions the sync can take time.

Here is a quick reference from the presto docs: https://prestodb.io/docs/current/connector/hive.html?highlight=sync_partition_metadata

Procedures#

  • system.create_empty_partition(schema_name, table_name, partition_columns, partition_values)
    Create an empty partition in the specified table.
  • system.sync_partition_metadata(schema_name, table_name, mode, case_sensitive)
    Check and update partitions list in metastore. There are three modes available:
    • ADD : add any partitions that exist on the file system but not in the metastore.
    • DROP: drop any partitions that exist in the metastore but not on the file system.
    • FULL: perform both ADD and DROP.

The case_sensitive argument is optional. The default value is true for compatibility with Hive’s MSCK REPAIR TABLE behavior, which expects the partition column names in file system paths to use lowercase (e.g. col_x=SomeValue). Partitions on the file system not conforming to this convention are ignored, unless the argument is set to false.

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

How do I run a CTAS (Create Table As) with a Query?

There are a few different ways to run a CTAS with a Query in Presto. Below we’ll lay those out.

OPTION 1: Managed Schema
Managed Schema Creation:
CREATE SCHEMA managed_schema WITH (location = 's3a://mybucket/managed_schema/');

Create Table AS with Managed Schema:

CREATE TABLE managed_movies_orc
COMMENT 'Managed Movies ORC'
WITH (format = 'ORC')
AS
select * from "ahana-demo".ahana_s3_movies

OPTION 2: External Table 
Technically you can’t do a create table as with an external table you will get the error:

presto error: External tables cannot be created using CREATE TABLE AS

You can however do a create table followed by insert into once the hive.non-managed-table-writes-enabled is set to true
https://prestodb.io/docs/current/connector/hive.html

CREATE TABLE <your schema>.ext_table_movies_orc( movieid bigint, title varchar, genres varchar ) 
WITH 
( format = 'ORC',
  external_location= 's3a://mybucket/externa_table/'
)

insert into <your schema>.ext_table_movies_orc select * from "ahana-demo".ahana_s3_movies

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

What is the difference between a managed table and external tables?

The main difference between a managed and external table is that when you drop an external table, the underlying data files stay intact. This is because the user is expected to manage the data files and directories. With a managed table, the underlying directories and data get wiped out when the table is dropped.

In Presto:
External Table: Table created using WITH has ‘external_location’
Managed Table: Table created in schema which has WITH used has ‘location’

You cannot “insert into” an external table (By default, the setting hive.non-managed-table-writes-enabled=false prevents you from doing so). 

The expectation that the data in the external table is managed externally. e.g. Spark, Hadoop, Python Scripts, or another external ETL process.

Below are the major differences between Internal vs External tables in Apache Hive.

INTERNAL OR MANAGED TABLEEXTERNAL TABLE
By default, Hive creates an Internal or Managed Table.Use EXTERNAL option/clause to create an external table
Hive owns the metadata, table data by managing the lifecycle of the tableHive manages the table metadata but not the underlying file.
Dropping an Internal table drops metadata from Hive Metastore and files from HDFSDropping an external table drops just metadata from Metastore without touching the actual file on HDFS/S3
Metadata on Inserts, creation of new partitions, etc. are updated automatically during inserts through the metastoreYou need to explicitly run sync_partitions to sync changes on S3 with the metastore

In short, use managed tables when the metastore should manage the lifecycle of the table, or when generating temporary tables. Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

Utilizing SiSense and Presto For Distributed Data Analysis

Data Analysis Using SiSense

SiSense is a full-fledged data analysis platform that provides actionable data analysis to its users for data-driven management. Being an end-to-end data analysis platform makes leveraging disparate sources of data, carrying out analysis on the data, and presenting the results to the users in an easy-to-use manner using it seamless and simple. Some of the data sources it supports are data files, MySQL, Oracle, Salesforce, and Big Query. By integrating different data stores, the platform can also overcome the problem of data silos and provide a holistic view of available data. Some of the organizations using SiSense include Nasdaq, Gitlab, Phillips, and Tinder.

Understanding How SiSense Works

SiSense is a web-based data analytics platform whereby servers are used for data ingestion, processing, and user interaction. The SiSense Server can be installed on a single machine or deployed in a cluster. Connecting to the data sources is handled by the ElastiCube Data Hub in two modes, self-managed Live Models or proprietary, super-fast ElastiCube Models. The ElastiCube Server and Application server handle data management and processing. The Web Server provides business users a way to interact with SiSense via a web app, mobile app, and REST API. Client apps handle various essential tasks like data source management, server management, distributed messaging, and node orchestration.

image from sisense.com

SiSense provides users with different ways of utilizing it for BI including manual installation and/or cloud deployment. Data engineers handle data connections and management to create data models ready for analysis. Data developers use the modes to carry out ad-hoc analysis, develop custom BI solutions and create UI  artifacts to be presented to business users.

What is Presto?

Presto was created by Facebook to handle the huge amounts of data it generates every minute. Like Hadoop, it could carry out distributed and parallel processing over numerous nodes. However, rather than writing intermediate results to disk, it holds them in memory. This allows for data processing in a matter of seconds instead of hours or days more common in Hadoop jobs. 

Distributed Data Analysis Using SiSense and Presto

The recommended way of using SiSense for big data and near real-time data analysis is multi-node deployment. This allows for optimal performance and efficiency, especially when using ElastiCube models. Combining SiSense with Presto allows the creation of highly scalable BI solutions with better performance that are easier to create.

Users benefit by having two distributed systems that are each highly optimized for the tasks they need to handle. Using Presto for data management provides access to more data stores and having dedicated and easily separable clusters handling data management. It is also highly optimized for handling data management and query execution due to its parallel and distributed architecture. It simplifies data management as it offers a single source of truth by creating a virtual data warehouse in which SiSense can use as its data source. SiSense then handles BI and analysis.

Ahana Cloud is the cloud-native SaaS managed service for Presto, see how you can turbocharge SiSense in 30 minutes!

Get Started with Presto & SiSense

Collaborative Data Science Using Mode SQL And Presto

Ahana Cloud for Presto is a SaaS Managed Service which speeds up the performance of leading business intelligence, data visualization, and SQL tools like Mode SQL.

Mode SQL In A Minute

Mode is a modern collaborative data science platform meant to help data scientists with diverse skillsets work effectively both individually and as a team. By supporting the use of SQL, Python, and R, Mode SQL enables users to use the language they are most comfortable working with or combine them as needed. Support for multiple users helps teams collaborate on projects and increases productivity. Created by Mode Analytics, it is an online platform meant to offer simplicity, power, and versatility to data scientists and analysts. This has resulted in more than 50% of the Fortune 500 companies using the platform. Some firms using Mode include VMWare, Bloomberg, Everlane, InVision, and Reddit.

Mode SQL

Among the products the platform offers is the Mode SQL Editor, which uses SQL to carry out data analysis online and to share the results with business users. To be able to use the platform, one either registers for the free plan that supports 5 users or the business or enterprise plans that offer more features, handle larger volumes of data, and support more users. Users provide connections to the data store they wish to use from the various databases supported. 

image from mode.com

Data scientists and analysts then create definitions and run queries against the connected database. Advanced logic like looping and logic statements can be added through Liquid. The results are used to create reports and dashboards using Mode chart builder or piped to notebooks for further analysis. Workspaces, member management, and report management can be handled programmatically through APIs. Combined with support for multiple databases, these features greatly simplify the data science workflow. Some of the data stores Mode SQL support include MariaDB, PostgreSQL, TimescaleDB, Snowflake, Vertica, Redshift, and Presto.

Why Use Presto

Using Presto increases the performance of the developed solutions due to its massively parallel processing (MPP) architecture. It enables users to process big data efficiently and improves response times. The use of SQL greatly reduces the entry barrier for beginners. This enables users to carry out analysis and gain valuable insights from the data faster, improving productivity and performance.

Data Science With Mode SQL and Presto

The Mode platform can be combined with the Presto query engine to provide users with a high-performance stack for developing custom data science solutions. Presto is able to federate many databases, providing both a virtualized data lake and access to more data sources than Mode connects to. Therefore, users have access to all their data in a centralized place which they can connect to using Mode SQL as Presto supports SQL. Results of the query can be processed with more powerful languages offering features absent in SQL on the platform using Liquid, Python, and/or R.

To use Mode SQL with Presto, one needs to have a presto database server/cluster running. One then connects to the database instance and writes SQL queries against it. The results of the queries are saved and used to build dashboards and reports to be accessed by end-users.

Ahana Cloud is the cloud-native SaaS managed service for Presto, see how you can turbocharge Mode in 30 minutes!

Get Started with Presto & Mode SQL

What is Presto and what are its frequently asked questions (FAQ)?

What is Presto?

Presto is a distributed SQL query engine written in Java. It takes any query written in SQL, analyzes the query, creates and schedules a query plan on a cluster of worker machines which are connected to data sources, and then returns the query results. The query plan may have a number of execution stages depending on the query. For example, if your query is joining together many large tables, it may need multiple stages to execute, aggregating tables together. After each execution stage there may be intermediate data sets. You can think of those intermediate answers like your scratch‐ pad for a long calculus problems.

Why is Presto so fast?

In the past, distributed query engines like Hive were designed to persist intermediate results to disk. As the below figure illustrates, Presto saves time by executing the queries in the memory of the worker machines, including performing operations on intermediate datasets there, instead of persisting them to disk. The data can reside in HDFS or any database or any data lake, and Presto performs the executions in-memory across your workers, shuffling data between workers as needed. Avoiding the need for writing and reading from disk between stages ultimately speeds up the query execution time. Hive intermediate data sets are persisted to disk. Presto executes tasks in-memory.

If this distributed in-memory model sounds familiar, that’s because Apache Spark uses the same basic concept to effectively replace MapReduce-based technologies. However, Spark and Presto manage stages differently. In Spark, data needs to be fully processed before passing to the next stage. Presto uses a pipeline processing approach and doesn’t need to wait for an entire stage to finish.
Presto was developed with the following design considerations:
• High performance with in-memory execution
• High scalability from 1 to 1000s of workers
• Flexibility to support a wide range of SQL use cases
• Highly pluggable architecture that makes it easy to extend Presto with custom integrations for security, event listeners, etc.
• Federation of data sources via Presto connectors
• Seamless integration with existing SQL systems by adhering to the ANSI SQL standard

Portions of this answer are excerpted from the O’Reilly book “Learning and Operating Presto,” get your free digital copy from Ahana: https://ahana.io/ebook/

What is Presto used for?

Presto is used as the distributed query engine as part of an analytics platform. The presto engine is used for many different types of analytical workloads to address a variety of problems. Leveraging a separate storage and compute model, also known as “disaggregated,” Presto can query one or more data stores, metastores, and SQL tools.

What are common Presto use cases?

1. Ad hoc querying
Presto was originally designed for interactive analytics, or ad hoc querying. That means results are returned in a matter of a few seconds so an analyst can drill-down and ask questions of large amounts of data interactively. In today’s “Internet era” competitive world and with data systems able to collect granular amounts of data in near-real-time, engineers, analysts, data scientists, and product managers all want the ability to quickly analyze their data and become data-driven organizations that make superior decisions, innovating quickly and improving their businesses.

They either simply type in simple queries by hand or use a range of visualization, dashboarding, and BI tools. Depending on the tools chosen, they can run 10s of complex concurrent queries against a Presto cluster. With Presto connectors and their in-place execution, platform teams can quickly provide access to the data sets that users want. Not only do analysts get access, but they can run queries in seconds and minutes–instead of hours–with the power of Presto, and they can iterate quickly on innovative hypotheses with the interactive exploration of any data set, residing anywhere.

2. Reporting and dashboarding
Because of the design and architecture of Presto and its ability to query across multiple sources, Presto is a great backend for reporting and dashboarding. Unlike the first-generation static reporting and dashboarding, today’s interactive reporting and dashboards are very different. Analysts, data scientists, product managers, marketers and other users not only want to look at KPI’s, product statistics, telemetry data and other data, but they also want to drill down into specific areas of interest or areas where opportunity may lie. This requires the backend – the underlying system – to be able to process data fast wherever it may sit. To support this type of self-service analytics, platform teams are required to either consolidate data into one system via expensive pipelining approaches or test and support every reporting tool with every database, data lake and data system their end users want to access. Presto gives data scientists, analysts and other users the ability to query data across sources on their own so they’re not dependent on data platform engineers. It also greatly simplifies the task of the data platform engineers by absorbing the integration testing and allowing them to have a single abstraction and end point for a range of reporting and dashboarding tools.

3. ETL using SQL
Analysts can aggregate terabytes of data across multiple data sources and run efficient ETL queries against that data with Presto. Instead of legacy batch processing systems, Presto can be used to run resource-efficient and high throughput queries. ETL can process all the data in the warehouse; it generates tables that are used for interactive analysis or feeding various downstream products and systems.

Presto as an engine is not an end-to-end ETL system, nor is Hive or Spark. Some additional tools can be easily added to coordinate and manage numerous on-going time-based jobs, a.k.a. cron jobs, which take data from one system and move it into another store, usually with a columnar format. Users can use a workflow manage‐ ment system like open source Apache Airflow or Azkaban. These automate tasks that would normally have to be run manually by a data engineer. Airflow is an open source project that programmatically authors, schedules and monitors ETL work‐ flows, and was built by Airbnb employees who were former Facebook employees. Azkaban, another open source project, is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs.

The queries in batch ETL jobs are much more expensive in terms of data volume and CPU than interactive jobs. As such the clusters tend to be much bigger. So some companies will separate Presto clusters: one for ETL and another one for ad hoc queries. This is operationally advantageous since it is the same Presto technology and requires the same skills. For the former, it’s much more important that the throughput of the entire system is good versus latency for an individual query.

4. Data Lake Analytics
Data lakes have grown in popularity along with the rise of Amazon S3-compatible object storage which Amazon AWS has made popular. A data lake enables you to store all your structured and unstructured data as-is and run different types of analytics on it.

A data warehouse is optimized to analyze relational data, typically from transactional systems and business applications, where the data is cleaned, enriched, and transformed. A data lake is different from a data warehouse in that it can store all your data–the relational data as well as non-relational data from a variety of sources, such as from mobile apps, social media, time-series data–and you can derive more new insights from the analysis of that broader data set. Again you can do so without necessarily needing to process and copy that data beforehand.

Presto is used to query data directly on a data lake without the need for additional ingestion or copies. You can query any type of data in your data lake, including both structured and unstructured data. As companies become more data-driven and need to make faster, more informed decisions, the need for analytics on an increasingly larger amount of data has become a higher priority in order to do business.

5. Real-time analytics with real-time databases
Real-time analytics is becoming increasingly used in conjunction with consumer- facing websites and services. This usually involves combining data that is being captured in real time with historical or archived data. Imagine if an e-commerce site had a history of your activity archived in an object store like S3, but your current session activity is getting written to a real-time database like Apache Pinot. Your current session activity may not make it into S3 for hours until the next snapshot. By using Presto to unite data across both systems, that website could provide you with real- time incentives so you don’t abandon your cart, or it could determine if there’s possible fraud happening earlier and with greater accuracy.

Portions of this article are excerpted from the O’Reilly book “Learning and Operating Presto,” get your free digital copy from Ahana: https://ahana.io/ebook/. This includes a chapter of Presto with Pinot, a real-time database.

What is Apache Presto?

Apache Presto is a misnomer in that it is not a project hosted under the Apache Software Foundation (ASF). Incubator and top level ASF projects are subject to the naming conventions “Apache [Project Name]” One example of this is Apache Cassandra. Instead, the open source presto project is with another well-known project hosting body, The Linux Foundation. They make sub-foundations and Presto is specifically under the Presto Foundation. This is similar to how Kubernetes is the project in Cloud Native Computing Foundation, which is a sub-foundation of The Linux Foundation. The misnomer may have arisen from the fact that most open source projects use the Apache 2.0 license, which Presto is licensed with.

Presto is a distributed SQL query engine written in Java. It takes any query written in SQL, analyzes the query, creates and schedules a query plan on a cluster of worker machines which are connected to data sources, and then returns the query results. The query plan may have a number of execution stages depending on the query. For example, if your query is joining together many large tables, it may need multiple stages to execute, aggregating tables together. After each execution stage there may be intermediate data sets. You can think of those intermediate answers like your scratch‐ pad for a long calculus problem.

What is Trino and what are its frequently asked questions (FAQ)?

What is Trino?

Trino is an apache 2.0 licensed, distributed SQL query engine, which was forked from the original Presto project whose Github repo was called PrestoDB. As such, it was designed from the ground up for fast queries against any amounts of data. It supports any types of data sources including relational and non-relational sources via its connector architecture.

What is the history of Trino?

Trino is a hard fork of the widely popular open source Presto project which started out at Facebook, running large-scale interactive analytic queries against a 300PB data lake using Hadoop/HDFS-based clusters. Prior to building Presto, Facebook used Apache Hive. In November, 2013, Facebook open sourced Presto under the Apache 2 license, and made it available in the public GitHub code repository named “prestodb”. In early 2019, the hard fork named Trino was started by the creators of Presto who later became cofounder/CTOs of the commercial vendor Starburst. In the meantime, Presto became part of the openly governed Presto Foundation, hosted under the guidance and experience of The Linux Foundation. Trino has subsequently divereged from Presto. Many of the innovations the community is driving in Presto are not available in Trino. Today, only Presto, not Trino is running at companies like Facebook, Uber, Twitter, and Alibaba.

Why is Trino so fast?

As Trino is a hard fork of the original Presto project, it carries with it some of the original elements which make Presto so fast, namely the in-memory execution architecture. Prior to Presto, distributed query engines such as Hive were designed to store intermediate results to disk.

How does Trino work?

Trino is a distributed system that runs on Hadoop, and uses an architecture similar to massively parallel processing (MPP) databases. It has one coordinator node working with multiple worker nodes. Users submit SQL to the coordinator which uses query and execution engine to parse, plan, and schedule a distributed query plan across the worker nodes. It supports standard ANSI SQL, including complex queries, joins aggregations, and outer joins.

What is Apache Trino?

Apache Trino is a misnomer in that Trino is not a project hosted under the well-known Apache Software Foundation (ASF). Apache Incubator and top level projects are subject to the naming conventions “Apache [Project Name].” An example of which is Apache Mesos. Instead trino project which is a hard fork of Presto is with a vendor controlled non-profit called the Trino Software Foundation. It is not affiliated with any well-known project hosting organizations like ASF for The Linux Foundation. The misnomer may have arisen from the fact that most open source projects use the Apache 2.0 license, which trino is licensed with.

Is Trino OLAP?

Trino is an open source distributed SQL query engine. It is a hard fork of the original Presto project created by Facebook. It lets developers run interactive analytics against large volumes of data. With Trino, organizations can easily use their existing SQL skills to query data without having to learn new complex languages. The Trino architecture is quite similar to traditional online analytical processing (OLAP) systems using distributed computing architectures, in which one controller node coordinates multiple worker nodes.

What is the Trino Software Foundation?

The Trino Software Foundation is a non-profit corporation which is controlled by the cofounders of the commercial vendor Starburst. The Trino Software Foundation has the open source Trino project. It is a hard fork of the Presto project, which is separate and hosted by the Linux Foundation. From the trino website there’s only two sentences about the foundation: “The Trino Software Foundation (formerly Presto Software Foundation) is an independent, non-profit organization with the mission of supporting a community of passionate users and developers devoted to the advancement of the Trino distributed SQL query engine for big data. It is dedicated to preserving the vision of high quality, performant, and dependable software.” What is not mentioned is any form of charter or governance. These are tables stakes for Linux Foundation projects, where the project governance is central to the project.

What SQL does Trino use?

Trino, like the original Presto, is built with a familiar SQL query interface that allows interactive SQL on many data sources. Standard ANSI SQL semantics are supported, including complex queries, joins, and aggregations.

What Is A Trino database?

Trino’s distributed system runs on Hadoop/HDFS and other data sources. It uses a classic MPP model (massively parallel processing). The java-based system has a coordinator node (master) working in conjunction with a scalable set of worker nodes. Users send their SQL query through a client to the Trino coordinator which plans and schedules a distributed query plan across all its worker nodes. Trino and Presto are SQL query engines and thus are not databases by themselves. They do not store any data, but from a user perspective, Trino can appear as a database because it queries the connected data stores.

What is the difference between Presto and Trino?

There are technical innovations and differences between Presto and Trino that include:
– Presto is developed, tested, and runs at scale at Facebook, Uber, and Twitter
– Presto uses 6X less memory and repartitions 2X faster with project Aria
– “Presto on Spark” today can run massive batch ETL jobs.
– Presto today is 10X faster with project RaptorX, providing caching at multiple levels
– The Presto community is making Presto more reliable and scalable with multiple coordinators instead of the single point of failure of one coordinator node.  

Trino can query data where it is stored, without needing to move data into separate warehouse or analytics database. Queries are executed in parallel with the memory of distributed worker machines. Most results return in seconds of time. Whereas Trino is a new fork, Presto continues to be used by many well-known companies: Facebook, Uber, Twitter, AWS. Trino is vendor driven project, as it is hosted in a non-profit organization which is owned by the cofounders of the Trino vendor Starburst. In comparison, Presto is hosted by Presto Foundation, a sub-foundation under The Linux Foundation. There are multiple vendors who support Presto, including the Presto as a Service (SaaS) offerings: Ahana Cloud for Presto and AWS Athena, which is based on Presto, not Trino.

As the diagram below illustrates, Presto saves time by running queries in the memory of the worker machines, running operations on intermediate datasets in-memory which is much faster, instead of persisting them to disk. It also shuffles data amongst the workers as needed. This also obviates the writes to disk between the stages. Hive intermediate data sets are persisted to disk. Presto executes tasks in-memory.

Whereas the pipelining approach between Presto and Trino is shared, Presto has a number of performance innovations that are not in Trino such as caching.  For more about the differences, see the April 2021 talk by Facebook at PrestoCon Day, which describe what they, along with others like Ahana, are doing to push the technology forward.

Trino is a distributed SQL query engine that is used best for running interactive analytic workloads on your data lakes and data sources. It is used for similar use cases that the original Presto project was designed for. It allows you to query against many different data sources whether its HDFS, Postgres, MySQL, Elastic, or a S3 based data lake. Trino is built on Java and can also integrate with other third party data sources or infrastructure components. 

After the query is parsed, Trino processes the workload into multiple stages across workers. Computing is done in-memory with staged pipelines.

To make Trino extensible to any data source, it was designed with storage abstraction to make it easy to build pluggable connectors. Because of this, Trino has a lot of connectors, including to non-relational sources like the Hadoop Distributed File System (HDFS), 
Amazon S3, Cassandra, MongoDB, and HBase, and relational sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server. Like the original community-driven open source Presto project, the data is queried where it is stored, without the need to move it into a separate analytics system.  

Turbocharge your BI and Analytics Using Superset and Presto

Ahana Cloud for Presto is a SaaS Managed Service which speeds up the performance of leading business intelligence, data visualization, and SQL tools like Superset.

Interactive, ad hoc queries and lightning fast data visualizations for Superset

What is Apache Superset? Apache Superset is a big data exploration and visualization platform that was created at an Airbnb hackathon in 2015. Since then, it has grown into an enterprise-level, open-source business intelligence (BI) system offering features found in commercial solutions like Looker and Tableau. It is a highly customizable BI system, able to visualize and explore huge amounts of data and is meant for users of all skills. Some of the companies using Superset include Ubisoft, Dropbox, Yahoo, and American Express.

Using Superset, one can use a GUI-based approach to create visualizations and perform data exploration or SQL, allowing ease of use for both casual and technical users. It allows for near real-time data analysis of huge amounts of data and supports geospatial data analytics. Superset can connect to many different SQL-based databases to enable easy creation of charts, interactive dashboards, geospatial visualizations, and data exploration using SQL or drag-and-drop among other features. Some of the databases supported include PostgreSQL, Oracle, Redshift, BigQuery, DB2, and Presto (not technically a database).

Anatomy of Apache Superset

Superset consists of a python-based backend and a javascript frontend. The backend handles connecting to different databases and carrying out data processing while the frontend handles data visualization and user interactions. The backend uses different connectors and SQLAlchemy to connect to different databases, Pandas for data analysis, and Flask for the web server. Optional visualization caching  is supported using software like Memcached and Redis.  A React/Redux architecture is utilized to create the frontend.

While Redash can use direct connections to individual supported data stores. Another way is to connect Redash to a distributed query engine like Presto, to enable higher performance, higher concurrent workloads and instant, seamless access to multiple data sources. With Redash + Presto, a distributed SQL query engine, it can become an even more powerful tool.

What is Presto?

Just like Superset, Presto was created for developing big data solutions. Presto can connect to more than just SQL-based data sources and is customized for distributed and parallel data querying. It then exposes these data sources, even S3 based data lakes as if they were SQL compliant. Using Superset with Presto enables the creation of a highly decoupled and scalable BI solution.

Presto is an open source SQL query engine for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data. Data platform teams are increasingly using Presto as the de facto SQL query engine to run analytics across data sources in-place. This means that Presto can query data where it is stored, without needing to move data into a separate analytics system. 

Faster Queries and Unified Access to more Data with Open Source Superset and Presto

+

Presto is one of the data sources that Superset supports. By combining them, developers and organizations can leverage the features of two open source, distributed, and highly scalable systems to meet their BI intelligence needs. A Superset cluster can be used to carry out data visualization and exploration while a Presto cluster can be used to connect to disparate data sources. This allows BI to be carried out using non-SQL data sources like HDFS and NoSQL databases.

Connecting them is usually done using pyhive, given that the default Presto connector is the hive connector. This is easily done using the command pip install pyhive. One then provides the URL to the Presto data source using the format hive://hive@host:port/database. Custom BI solutions can be created using SQL that is passed to Presto, which handles the actual query processing. Query results are passed to Superset for data analysis and visualization.

If you want to get started with Presto, you can check out Ahana Cloud. It’s SaaS for Presto and offers a fully managed service for Presto in the cloud. Ahana Cloud comes pre-integrated with Apache Superset.

With a few clicks you can add Superset to your Presto cluster.

Turbocharge Business Intelligence Using Redash and Presto SQL query engine

Ahana Cloud for Presto is a SaaS Managed Service which speeds up the performance of leading business intelligence, data visualization, and SQL tools like Redash.

Interactive, ad hoc queries and faster data visualizations for Redash

What is Redash? Redash is an open source business intelligence (BI) system that uses SQL for data analysis and visualization. It enables the collaborative creation of visualizations and dashboards from different data sources using SQL. The data sources are not restricted to SQL conformant databases but include NoSQL databases like MongoDB and object storage systems like AWS S3. The visualizations, dashboards, reusable snippets, and other Redash artifacts are then shared either internally or publicly with other users to provide insight and support data-driven decision-making. This allows organizations and individuals to strategize and make decisions based on the latest data available to them. Some enterprise companies using Redash include Mozilla, SoundCloud, Atlassian, and Cloudflare.

Advantages of Using Redash

There are many advantages of using Redash for data analytics and visualization. Being open source, it can be installed for free and modified as per each specific use case with no restrictions. It can be installed and hosted free manually or accessed via premium cloud offerings, preventing vendor lock-in. The use of SQL for carrying out data processing makes it accessible to many users who already know SQL. Redash’s web-based interface and collaborative nature make it easy to create, share and use BI artifacts across diverse teams and departments.

image from redash.io

Redash Internals

Redash consists of a javascript frontend and python backend. The frontend is based on React while older versions are based on AngularJS. The frontend handles user interaction and has the SQL editor, dashboards, and visualizations. The backend consists of a Flask web server that provides a REST API and a PostgreSQL database that handles data caching. Queries are handled using Celery workers responsible for actual data processing and connecting to the various data sources. Redash is packaged as a single-page app (SPA) web-based system with high scalability and availability.

One way Redash is used is via direct connections to individual supported data stores. The other way is to connect Redash to a distributed query engine like Presto, to enable higher performance, higher concurrent workloads and instant, seamless access to multiple data sources. With Redash + Presto, a distributed SQL query engine, it can become an even more powerful tool.

What is Presto?

Presto is a massively parallel processing (MPP) query engine meant to interact with different data sources and process data extremely fast. This is achieved by storing data and intermediate results in-memory rather than writing them to disk as Hadoop does. Using Redash and Presto enables the separation of the BI and querying systems. Different people thus can use, optimize, and manage them independently.

How Redash works with Presto

+

Combining Redash and Presto allows the development of free and distributed BI systems as both of them are open source. Typically, this is achieved by having a frontend Redash cluster that communicates with a backend Presto cluster. The backend handles query processing and data management while the frontend provides the user interface. 

Ahana Cloud customer Cartona uses Redash with Presto and Ahana Cloud to power its dashboards. Learn more about their use case in their presentation.

PrestoCon Day, 2020 Talk by eCommerce company Cartona

Having Presto as the query engine provides better performance and access to more data sources via the Presto connectors.

The architecture consists of Redash connected to a presto cluster with one or more connected data sources. Presto handles the data access and in-memory processing of queries. Redash handles the visualization of reports and dashboards. One configures Redash to connect to the Presto cluster by setting the name value to Presto and providing the other properties like host, port, and catalog as appropriate. This allows the presto cluster to be scaled by adding or removing processing nodes to meet the requirements of the Redash users. Users are then able to run queries against the data sources easily as both Presto and Redash use an SQL interface. Integrating them offers other advantages such as data federation, fast query processing, and being able to have different clusters that can be optimized to best meet the needs of business analysts and data scientists.

Ahana Cloud is the cloud-native SaaS managed service for Presto, see how you can turbocharge Redash in 30 minutes!

Get Started with Presto & Redash

How does the Price-Performance ratio of AWS Athena Presto compare to Ahana Cloud for Presto?

First, Overall Differences

Both AWS Athena and Ahana Cloud are based on the popular open-source Presto project which was originally developed by Facebook and later donated to the Linux Foundation’s Presto Foundation. There are a handful of popular services that use Presto, including both AWS Athena and Ahana Cloud. 

The biggest difference between the two is that AWS Athena is a serverless architecture while Ahana Cloud is a managed service for Presto servers. The next biggest difference is the pricing model. Instead of paying for the amount of compute used by AWS Athena, you pay by the amount of data scanned. On the other hand, Ahana Cloud is priced by the amount of compute used. This can be a huge difference in price/performance. Before we get into the price-performance specifically, here’s an overview of the comparison:

AWS Athena (serverless Presto)Ahana Cloud for Presto (managed service)
Cost dimensionPay for the amount of data is scanned on on a per query basis at USD $5 per Terabyte Scanned. It may be hard to estimate how much data your queries will scan. Pay only for EC2 usage on a per node / hour basis for EC2 and Ahana
Cost effectivenessOnly pay while the query is scanning, not for idle timesOnly pay for EC2 and Ahana Cloud while compute resources are running, plus ~$4 per day for the managed service
ScaleAWS Athena can scale query workloads but has concurrency limitsAhana easily can scale query workloads without concurrency limits
Operational overheadLowest operational overhead: no need to patch OS – AWS handles thatLow operational overhead: no need to patch OS – Ahana Cloud handles that and the operation of servers
Update frequencyInfrequent updates to the platform. Not current with PrestoDB, over 60 releases behind.Frequent updates to the platform. Typically, Presto on Ahana Cloud is upgraded on a quarterly basis to keep up with the most recent releases.

Both let you focus on deriving insight from your analytical queries, as you can leave the heavy lifting of managing the infrastructure to AWS and the Ahana Cloud managed service. 

How do you define price-performance ratio?

Price–performance ratio
From Wikipedia, the free encyclopedia
In engineering, the price–performance ratio refers to a product’s ability to deliver performance, of any sort, for its price. Generally, products with a lower price/performance ratio are more desirable, excluding other factors.

Comparing the Price-Performance Ratio of Amazon Athena vs. Ahana Cloud

For this comparison, we’ll look at performance in terms of the amount of wall-clock time it takes for a set of concurrent queries to finish. The price is the total cost of running those queries. 

Instead of using a synthetic benchmark, we’ll look at the public case study on the real-world workloads from Carbon, who used Athena and then switch to Ahana Cloud. While your workloads will be different, you’ll see why the price-performance ratio is likely many times better with Ahana Cloud. And by going through an example, you’ll be able to also apply the same method when doing a quick trial (we’re here to help too.)

Here’s a few things that the Carbon public case study showed:

  • While you cannot tell how many or type of EC2 instances that are used by Athena V2, they determined that they could get similar performance with 10 c5.xlarge workers with Ahana Cloud
  • Athena V2 would start to queue queries after there were 2 other queries running, meaning that the amount of wall-clock time was extended as a result.

AWS Athena is constrained by AWS concurrency limits

Ahana has higher concurrency so queries finish faster

  • The queries would be charged at a rate of $5/TB scanned regardless of the amount of compute used. Their 7 tests ended up scanning X TBs = $Y 
  • Ahana Cloud with 10 X c5.xlarge workers has total costs of:
TypeInstancePrice/hrQty.Cost/hr
Presto Workerc5.xlarge17 cents10$1.70
Presto Coordinatorc5.xlarge17 cents1$0.17
Ahana Cloud10 cents11$1.10
Ahana Cloud Managed Service8 cents1$0.08
Total$3.45

So, you can run many queries for one hour that scan any amount of data for only $3.45 compared to one query of Athena scanning one TB of data costing $5.00.

Summary

While there is value in the simplicity of AWS Athena’s serverless approach, there are trade-offs around price-performance. Ahana Cloud can help.

Ahana is an easy cloud-native managed service with pay-as-you-go-pricing for your PrestoDB deployment. 

Get started with a 14-day free trial today.

Turbocharge Zeppelin with the lightning-fast Presto SQL query engine

What is Zeppelin?

Zeppelin is an open-source tool for producing data science notebooks that are then stored as .json files. 

This kind of analytics software is called a notebook. The interface of Zeppelin consists of two different kinds of spaces which are both called notes. You can use the first kind to type in your code, free text, or markdown-formatted annotations. Once you run the code, results appear underneath as an output.  

The outputs can be charts, query results, or a markdown preview. In Zeppelin, you can organize such notes — code and results sections — into dashboards by stacking them or placing them next to each other. 

With Zeppelin, you not only create data pipelines, but you can also visualize them as graphs to explain your ETL setup to others. If you want to make Zeppelin usable for your organization (like the marketing or sales teams, for example), you can build dynamic forms such as query templates. The users would only fill in the form and do not need to worry about the syntax and the correct structure of their queries.

Zeppelin works with a wide range of programming languages that are used in data analytics and data engineering. Among them are SQL, Python, Scala, shell, and markdown. This makes Zeppelin a popular tool for data scientists and engineers working with Apache data processing tools and frameworks. 

Zeppelin offers multi-user management for collaborative work on big projects. Companies use Zeppelin to generate large amounts of structured and unstructured data that sometimes has to be stored in different databases and data lakes. You can pull data from them all into one notebook using different interpreters. There is no need to create multiple notebooks.

For querying across Zeppelin and other data sources, many have turned to Presto, an open-source distributed query engine.

What Is Presto?

Presto is an open source distributed SQL query engine that you can connect to many different data sources like databases, cloud storage, cloud storage, NoSQL databases, and many more. Many use Presto for their data lake analytics needs, as well as federated querying across a range of data sources.

A Presto cluster consists of a single coordinator and several worker nodes. The worker nodes are responsible for connecting to various sources and transparently carrying out query processing in a distributed and parallel manner. The computational power of a cluster can thus be increased by adding the number of worker nodes. This has made it an efficient choice for organizations with different data formats and sources and/or a large amount of data to process.

How Presto Works With Zeppelin

With Presto, you can query structured and unstructured data at once inside one Zeppelin note.

To integrate Presto with Zeppelin and across other data sources, you will need to add Presto as a new interpreter in Zeppelin. This can be done without any code as Zeppelin has an interface feature for it. You add a JDBC connector configured for Presto. Then, simply start writing a SQL query in a new note preceding it with the %jdbdc(presto) command to tell the note which interpreter to use. 

Since Presto uses SQL, you won’t have to invest time into learning another niche programming language. Thus, speed, scalability, and familiarity make Presto a popular tool for real-time analytics on Zeppelin.

Turbocharge Trifacta with the lightning-fast Presto SQL query engine

Ahana Cloud for Presto is a SaaS Managed Service which speeds up the performance of leading business intelligence, data visualization, and SQL tools like Trifacta.

Interactive, ad hoc queries and faster data exploration for Trifacta

What is Trifacta? Trifacta is a cloud-based data engineering front end tool that is used by data analysts and data scientists to leverage an organization’s data. It enables data collection and consolidation, preparation, data transformation, and creation of data pipelines to be performed easily. These data pipelines can then be used for different use cases like business intelligence (BI), machine learning, and data analytics.

Trifacta is a cloud-native solution meant to make it easy to use data in different formats and structures for tasks like data analysis and machine learning. This is achieved by simplifying the tasks of data cleaning, data exploration, data preparation, data validation, and creating data pipelines since the user can use a visual interface. It is supported on different cloud platforms including AWS, Google Cloud, and Microsoft Azure.

image from trifacta.com

Trifacta is used to create data engineering pipelines for data exploration in a cloud environment. One way Trifacta is used is via direct connections to individual supported data stores. The other way is to connect Trifacta to a distributed query engine like Presto, to enable higher performance, higher concurrent workloads and instant, seamless access to multiple data sources.

What Is Presto?

Presto is an in-memory distributed query engine with a connector-based architecture to disparate data sources like S3 cloud storage, relational, and NoSQL databases. It was developed by Facebook to enable them with lightning-fast query responses from their HDFS data lake / warehouse. Since then, it has been adopted by other hundreds of other companies including Uber, Twitter, Amazon, and Alibaba.

A Presto cluster consists of a single coordinator and several worker nodes. The worker nodes are responsible for connecting to various sources and transparently carrying out query processing in a distributed and parallel manner. The computational power of a cluster can thus be increased by adding the number of worker nodes. This has made it an efficient choice for organizations with different data formats and sources and/or a large amount of data to process.

Faster Queries and Unified Access to more Data using Trifacta and PrestoDB

+

As organizations become more data driven, both end-user computing and data access requirements are increasing. The solutions therefore must have high performance and be scalable to meet the demands placed on today’s data platform teams, who must respond quickly. This calls for the adoption of an open, flexible, distributed architecture. Combining Trifacta and Presto enables organizations to create a highly scalable, distributed, and modern data engineering platform.

A typical architecture consists of Trifacta connected to a presto cluster with one or more connected data sources. Presto handles the data access and in-memory processing of queries. Trifacta handles the visualization, reporting, and data wrangling tasks. This allows the presto cluster to be scaled by adding or removing processing nodes to meet the requirements of the Trifacta users. Integrating them offers other benefits such as data federation, fast query processing, and being able to have different clusters that can be optimized to best meet the needs of data engineering workloads.

Ahana Cloud is the cloud-native SaaS managed service for Presto, see how you can turbocharge Trifacta in only 30 minutes!

Getting Started is Easy and Free

Turbocharge Tableau with the lightning-fast Presto SQL query engine

Ahana Cloud for Presto is a SaaS Managed Service which speeds up the performance of leading business intelligence, data visualization, and SQL tools like Tableau.

Interactive, ad hoc queries and faster data visualizations for Tableau

What is Tableau? Tableau is a data visualization tool that can query a wide range of data sources like relational databases, cloud databases, spreadsheets, data warehouses, and much more.

Tableau allows people to see and understand their data and provides business intelligence. Users can combine data from various data sources for analysis in dashboards. You can also manage Tableau in your own environment or with a managed service. It offers an interface suitable for both visualization beginners and experts.

image from tableau.com

Tableau is capable of creating visualizations with sophisticated data analysis. Apart from building charts, Tableau helps to uncover deeply hidden relationships in the data. It also allows you to do calculations right inside the visualization layer, which makes it easier to adjust on the fly.

Consequently, Tableau is used regularly by academia and business professionals in different fields. Data scientists use Tableau to get insights on data spread across their business. Engineers use Tableau to quickly plot data and automate reporting for other business units.. Business analysts create self-service reports for top managers.

One way Tableau is used is via direct connections to individual supported data stores. The other way is to connect Tableau to a distributed query engine like Presto, to enable higher performance, higher concurrent workloads and instant, seamless access to multiple data sources. With Tableau + Presto, a distributed SQL query engine, it can become an even more powerful tool.

What is Presto?

Great analytics requires great amounts of data and a speedy query engine. This is where Presto comes into play. 

It is called a distributed query engine since uses a cluster of scale-out worker instances. It’s called a federated engine because it allows you to fetch data from multiple sources inside the same query. Furthermore, Presto works with both relational and non-relational databases. Even petabytes of data won’t slow down its performance and prevent it from delivering results within seconds.

Presto is open-source and hosted under the Linux Foundation license. Anyone can download it from a GitHub repository and use it.

A few things make Presto stand out. More often than not, to query your raw data, you would need to move it to a new warehouse first. Presto works in-memory, querying the data directly where you store it. It creates an abstraction layer to transform data on the fly and dispatch it to you.

Presto runs on Hadoop(HDFS), S3-based cloud data lakes, and supports a wide variety of data sources. Presto’s architecture is close to being a massively parallel processing (MPP) database management system but is more advanced. 

Presto has multiple worker nodes that are orchestrated by a coordinator node. The worker nodes fetch your data across sources, and the coordinator plans the execution of your query and delivers the results.

Faster Queries and Unified Access to more Data using Tableau and PrestoDB

+

Traditionally, Business Intelligence tools, such as Tableau, are optimized for structured data queries against data warehouses. These BI tools were unable to support big data sources efficiently. It made for a slow (and sometimes expensive) process with a lot of manual workflows.

It’s also a manual process to connect Tableau to your various data sources, and you have to manage each one. If you connect Tableau to Presto, then you can run your queries with Presto which will query all of your data sources with one query, without the need to manage all of those connections. The performance is much better as well, especially when you have many users. Combining Tableau and Presto enables organizations to create a highly scalable, distributed, and modern data engineering platform.

You can connect to Presto from the Connectors menu directly or by using an SQL statement. Once connected, you will be able to access live Presto data within Tableau. You can also publish completed workbooks with Presto data.

A typical architecture consists of Tableau connected to a presto cluster with one or more connected data sources. Presto handles the data access and in-memory processing of queries. Tableau handles the visualization of reports and dashboards. This allows the presto cluster to be scaled by adding or removing processing nodes to meet the requirements of the Tableau users. Integrating them offers other benefits such as data federation, fast query processing, and being able to have different clusters that can be optimized to best meet the needs of business analysts and data scientists.

Ahana Cloud is the cloud-native SaaS managed service for Presto, see how you can turbocharge Tableau in 30 minutes!

Getting Started is Easy and Free

Presto and AWS S3

What is AWS S3?

Amazon Simple Storage Service (Amazon S3) is storage for the internet. Amazon S3 is used to store and retrieve any amount of data at any time, from anywhere on the web, by using the web interface of the AWS Management Console.

What is Presto?

PrestoDB is a federated SQL query engine for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data, which continues to grow exponentially across a wide range of data lakes and databases. Many organizations are adopting Presto as a single engine to query against all available data sources. Data platform teams are increasingly using Presto as the de facto SQL query engine to run analytics across data sources in-place. This means that Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. 

Why use Presto with AWS S3?

Analysts get better performance at a lower cost by using S3 with Presto, as users can scale their workloads quickly and automatically. Presto allows users to quickly query both unstructured and structured data. Presto is an ideal workload in the cloud because the cloud provides performance, scalability, reliability, availability, and massive economies of scale. You can launch a Presto cluster in minutes, without needing to worry about node provisioning, cluster setup, Presto configuration, or cluster tuning.

Presto executes queries over data sets that are provided by plugins known as Connectors. Integrating Presto with S3 provides users with several features:

  • Presto, running on Amazon EMR, allows developers and analysts to easily run interactive SQL queries that directly access data stored in Amazon S3 for data-exploration, ad-hoc analysis and reporting.
  • The Hive connector allows Presto to query data stored in S3-compatible engines and registered in a Hive Metastore (HMS). 
  • Data transfer between a Presto cluster and S3 is fully parallelized.
  • Presto can be easily deployed using the AWS Serverless platform, with no servers, virtual machines, or clusters to set up, manage or tune. 

Since Presto is based on ANSI SQL, it’s very straightforward to start using it. The Presto connector architecture enables the federated access of almost any data source, whether a database, data lake or other data system. Presto can start from one node and scale to thousands. With Presto, users can use SQL to run ad hoc queries whenever you want, wherever your data resides. Presto allows users to query data where it’s stored so they don’t have to ETL data into a separate system. With an Amazon S3 connector, platform teams can simply point to their data on Amazon S3, define the schema, and start querying using the built-in query editor, or with their existing Business Intelligence (BI) tools. With Presto and S3, you can mine the treasures in your data quickly, and use your data to acquire new insights for your business and customers.

For users who are ready to use Presto to query their AWS S3 but don’t want to worry about the complexities or overhead of managing Presto, you can use a managed service like Ahana Cloud. Ahana Cloud gives you the power of Presto without having to get under the hood. It’s a managed service for AWS and has out-of-the-box integrations with AWS S3, in addition to AWS Glue and Hive Metastore (HMS).

You can try Ahana Cloud out free for 14 days, sign up and get started today.

Presto and Amazon Redshift

What is Amazon Redshift?

Amazon Redshift is a cloud data warehouse application service used by data analysts or data warehouse engineers for analyzing data using standard SQL and your existing Business Intelligence (BI) tools. Users can start with just a few hundred gigabytes of data and scale to a petabyte or more.

What is Presto?

Presto is a federated SQL query engine for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data, which continues to grow exponentially across a wide range of data lakes and databases. Many organizations are adopting Presto as a single engine to query against all available data sources. Data platform teams are increasingly using Presto as the de facto SQL query engine to run analytics across data sources in-place. This means that Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. 

Why Presto and Amazon Redshift?

Analysts get better performance at a lower cost by using Amazon Redshift with Presto, as users can scale their workloads quickly and automatically. Presto allows users to quickly query both unstructured and structured data. Presto is an ideal workload in the cloud because the cloud provides performance, scalability, reliability, availability, and massive economies of scale. You can launch a Presto cluster in minutes, without needing to worry about node provisioning, cluster setup, Presto configuration, or cluster tuning.

Presto executes queries over data sets that are provided by plugins known as Connectors. Integrating Presto with Redshift provides users with new capabilities:

  • Presto reads data directly from HDFS, so you don’t need to perform ETL on the data. Presto has also been extended to operate over different kinds of data sources including traditional relational databases and other data sources such as Redshift.
  • The Redshift connector allows users to query and create tables in an external Amazon Redshift cluster. Users can join data between different systems like Redshift and Hive, or between two different Redshift clusters. Since Presto on Amazon EMR supports spot instances, the total cost of running a data platform is lower.
  • Presto can reduce query execution time. Presto provides the ability to run queries in seconds instead of hours, and analysts can iterate quickly on innovative hypotheses with the interactive exploration of any dataset, residing anywhere.

Since Presto is based on ANSI SQL, it’s very straightforward to start using it. The Presto connector architecture enables the federated access of almost any data source, whether a database, data lake, or other data system. Presto can start from one node and scale to thousands. With Presto, users can use SQL to run ad hoc queries whenever you want, wherever your data resides. Presto allows users to query data where it’s stored so you don’t have to ETL data into a separate system. With a Redshift connector, platform teams can quickly provide access to datasets that analysts are interested in while being able to scale and meet the growing requirements for analytics. With Presto and Redshift, you can mine the treasures in your data quickly, and use your data to acquire new insights for your business and customers.

Turbocharge Qlik with the lightning-fast Presto SQL query engine

Ahana Cloud for Presto is a SaaS Managed Service which speeds up the performance of leading business intelligence, data visualization, and SQL tools like Qlik.

Faster data visualizations & interactive, ad hoc queries for Qlik

What is Qlik? Qlik is a platform providing data visualization and BI for end-to-end analysis of data. It offers interactive and user-customizable data visualizations, data federations, data analytics, and business intelligence (BI). Qlik enables its users to use data from different sources to carry out interactive data analysis and visualization. Organizations can leverage the available data for day-to-day decision-making and daily operational processes, promoting a data-driven culture. Some of the companies using Qlik include Samsung, Toyota, HSBC, Merck, and Shell.

Qlik comes in two versions, a Windows application and a cloud offering. Qlik can also be deployed on multiple cloud environments. It provides an association between data that could potentially reside in different data sources and carrying out analytics in-memory for improved performance. This provides fast analytics and a unified view of an organization’s data.

The Architecture of a Qlik App

A Qlik app consists of data source, back-end, and front-end components. The data source end is responsible for handling data access between the app the various data sources used. The back-end component container contains the Publisher which is responsible for pulling data from the different sources and providing the view files to a Qlik Server. It is mainly used by software developers, system analysts, database administrators, and other tech personnel.

image from qlik.com

The front-end accesses the views provided by the server and presents it as an interactive web app. It leverages the associative model used by Qlik to make selections, the main way of carrying out data analysis. The main users are organizational employees in different departments including marketing, sales, finance, management, and logistics.

What Is Presto?

Presto is a SQL query engine originally developed by Facebook to replace Hive, enabling to quickly access the social media platform’s insights against huge amounts of data. It is meant to connect to different data sources and to perform queries against them in a parallel and distributed manner. Presto is an open source project, housed by The Linux Foundation’s Presto Foundation. The connector-based architecture allows disparate data sources like S3 cloud storage, relational, and NoSQL databases to be queried. Many hundreds of other companies have since adopted Presto including Uber, Twitter, Amazon, and Alibaba.

A Presto cluster consists of a single coordinator and several worker nodes. The worker nodes are responsible for connecting to various sources and transparently carrying out query processing in a distributed and parallel approach. The computational power of a cluster can thus be increased by increasing the number of worker nodes. This has made it a lightning-fast choice for organizations with different data formats and sources and/or a large amount of data to process.

Faster Queries and Unified Access to more Data using Qlik and Presto

+

By using Presto, an organization can create clusters running Qlik applications targeted at different use cases. These use cases include functions such as data aggregation, data analysis, ad-hoc querying, and data visualization. It also helps an organization develop a unified view of the data. Qlik apps only have to interface with one data service, Presto, while being able to query multiple data sources. This is because Presto abstracts the different data sources being used by the apps.

The common approach is to deploy different Presto clusters for different use cases. The Qlik apps then connect to the clusters, relying on the clusters to perform actual data querying. Being an in-memory distributed query engine, PrestoDB can process large amounts in very short periods. The use of a cluster architecture means that horizontal scaling can be easily and efficiently carried out.

Combining Qlik and Presto enables organizations to create a highly scalable, distributed, and modern data engineering platform.

A typical architecture consists of Qlik connected to a presto cluster which, in turn, is connected to one or more data sources. Presto handles the data access and in-memory processing of queries. Qlik handles the visualization, reporting, and dashboarding. This allows the presto cluster to be scaled by adding or removing processing nodes to meet the workloads of the Qlik users. Integrating them offers other benefits such as data federation, fast query processing, and being able to have different clusters that can be optimized to best meet the needs of business analysts and data scientists.

Ahana Cloud is the cloud-native SaaS managed service for Presto, see how you can turbocharge Qlik in 30 minutes!

Getting Started is Easy and Free

Presto and MySQL

What is MySQL?

MySQL is a popular open-source relational database management system (RDBMS) that data engineers use to organize data into tables. MySQL supports foreign keys to create relationships between tables. You can create, modify and extract data from the relational database using SQL commands.

MySQL is easy and intuitive to use; you only have to know a few SQL statements to use it, like SHOW DATABASES, CREATE DATABASE, USE, CREATE TABLE, SHOW TABLES, DESCRIBE, INSERT INTO … VALUES,  SELECT, WHERE, DELETE, ALTER TABLE…ADD, ALTER TABLE…DROP, and SHOW CREATE TABLE. 

MySQL has a client-server architecture, which means that many clients can communicate with the MySQL server to run queries, create tables, etc. You can scale with ease; MySQL is fully multithreaded and handles large data volumes. It supports over 20 platforms like Linux, Windows, UNIX, OS, etc. You can roll back transactions, commit and undo MySQL statements using the ROLLBACK statement. It has memory efficiency. You can partition your data, which improves query performance.

MySQL Use Cases

Data engineers don’t typically use MySQL to run queries against massive data volumes. Instead, more common use cases might include: 

•      Small web-based databases: an eCommerce store, a web-page database, and an app database (with a small data volume)

•      Online analytical processing and online transaction processing systems (OLAP/OLTP) (with small data volume)

•      Building a business intelligence tool that doesn’t have a massive data volume

If you want to query big data with your MySQL database, you can use PrestoDB.

What is PrestoDB?

PrestoDB is an open source SQL query engine to query data lakes and across data sources, without having to move your data. Presto runs queries directly on files stored in databases like MySQL. It joins multiple databases from different data sources like MySQL, Hadoop, Cassandra, etc.

Using MySQL and Presto together

Data engineers commonly use MySQL and PrestoDB for joining different systems like Hive, Cassandra, and MySQL and running queries across those data sources; it’s common to use partitioning to run queries efficiently.   

You can use MySQL and PrestoDB together to run queries against a massive data volume. PrestoDB works directly on files stored in MySQL storage.  

PrestoDB has a MySQL connector to run queries and create tables in your MySQL database. You also can join different MySQL databases or different systems like MySQL and Hive. 

PrestoDB is used to map the data in the MySQL storage engine to schemas and tables. Data engineers use the Hive metastore service to access the metadata catalog that lives in the MySQL database.

You create a catalog properties file named “catalog_name.properties” in “etc/catalog” to mount the MySQL connector. This file must have the connector.name, connection-url, connection-user, and the connection-password. You can have as many catalog files as you want.

In PrestoDB, we have schemas that contain database objects like tables. So, we have one schema for each database. To create, modify and extract data on your MySQL database using PrestoDB, you can use the next statements: 

•      USE dbname: it selects the database

•      CREATE DATABASE dbname: it creates a database

•      CREATE TABLE table_name: it creates a table

•      LOAD DATA: it loads data from a file

•      INTO TABLE table_name: it adds rows to a table

•      SHOW SCHEMAS FROM catalog_name: it shows all databases from that catalog. 

•      SHOW TABLES FROM catalog_name.dbname: it shows all tables from the database (dbname). 

•      DESCRIBE catalog_name.dbname.table_name: it shows a list of columns in the table (table_name) in the database (dbname).

•      SHOW COLUMNS FROM  catalog_name.dbname.table_name: it shows a list of columns in the table (table_name) in the database (dbname).

•      SELECT * FROM catalog_name.dbname.table_name: it shows the table (table_name) in the database (dbname). 

If you’re ready to use Presto to query MySQL but don’t want to worry about the complexities or overhead of managing Presto, you can use a managed service like Ahana Cloud. Ahana Cloud gives you the power of Presto without having to get under the hood. It’s a managed service for AWS and can integrate with MySQL.

Get started with a 14-day free trial of Ahana Cloud.

Presto and PostgreSQL

What is PostgreSQL?

PostgreSQL is an enterprise-grade open source database management system (DBMS) that is free and offers advanced functionality common to more expensive dbmses. It is an object-relational database that supports multiple programming languages, custom data types, JSON, relational modeling, and table partitioning. Some of the companies using PostgreSQL in production include Instagram, Apple, Cisco, and Red Hat.

PostgreSQL’s wide usage and support is a result of several reasons. Being open source, the database comes for free and benefits from contributions from a wide developer community. Due to its wide usage, there are many talented and experienced people who are already conversant with it. Therefore, there is no shortage of skilled personnel who can implement and maintain data management systems powered by PostgreSQL. Features such as table partitioning, replication, and geospatial data support mean it has features normally found in closed source enterprise dbmses that are expensive to acquire and run.

PostgreSQL is highly extensible and scalable and is used to manage massive datasets reliably. Support for foreign data wrappers ensures that it integrates easily to data streams and other databases. This makes it ideal for serving as the data storage system for online analytical processing (OLAP), advanced analytics, and BI solutions. Installation is simple since it is free to download. Additionally, free docker containers and paid cloud solutions exist to support a cloud-based microservices architecture for even better horizontal scalability of the developed systems.

What Is Presto?

Presto is a query engine used to query different data stores in differing formats in a distributed and parallel manner. It was originally developed by Facebook to manage its data warehouse developed using Apache Hive. 

Presto architecture consists of a coordinator node and multiple worker nodes. The coordinator node is responsible for accepting queries and returning results. The worker nodes do the actual computation and connect to the data stores. This distributed architecture makes Presto fast and scalable.

Leveraging PostgreSQL and Presto

Combining PostgreSQL and PrestoDB offers many benefits to developers and organizations. The advanced database features like geospatial data support, full text search, and support for different programming languages combined with being open source make it highly customizable. Developers are therefore free to configure it as they deem best for their systems. Using Presto enables better solutions as it is also highly customizable, open source, and distributed. Worker nodes can integrate easily with multiple nodes running the database as Presto ships with a free PostgreSQL plugin.

PrestoDB works by using the plugin to connect to postgreSQL. The plugin is responsible for generating SQL statements that are compatible with the database. Worker nodes use the plugins to communicate with the dbms to issue commands and fetch data. By pushing down as much work as possible to the database, Presto is able to leverage the power and features of PostgreSQL. This makes them a great combination for developing big data solutions. Some of the use cases of Presto and PostgreSQL are machine learning, data analysis, data visualization, and business intelligence.

If you’re ready to use Presto to query PostgreSQL but don’t want to worry about the complexities or overhead of managing Presto, you can use a managed service like Ahana Cloud. Ahana Cloud gives you the power of Presto without having to get under the hood. It’s a managed service for AWS and can integrate with PostgreSQL.

Get started with a 14-day free trial of Ahana Cloud.

Turbocharge Looker with the lightning-fast Presto SQL query engine

Ahana Cloud for Presto is the SaaS Managed Service which speeds up the performance of leading business intelligence, data visualization, and SQL tools like Looker.

Interactive, ad hoc queries and faster data visualizations for Looker

What is Looker? Looker is a business intelligence (BI) platform that helps individuals, businesses, and organizations get better insights from available data. Acquired by Google in 2019 and part of GCP (Google Cloud Platform), it helps users to explore data and find meaningful trends and patterns and in data visualization. It is available on GCP and AWS as a cloud solution. It can be used to create custom solutions too as needed. Looker can connect to different databases including BigQuery, MySQL, PostgreSQL, Oracle, PrestoDB, and Vertica, and supports more than 50 different SQL dialects.

Looker users include software developers, data scientists, marketers, and management for different use cases. These include business intelligence, data analytics, and supporting  data-driven decision-making from different data sources. Its power lies in being able to interact with different data sources to create a holistic view of the available data to different users and departments. This makes it easy to manage workflows in the whole project or organization in a data-driven manner.

Since the users do not have to learn SQL, everyone in a project/organization can customize their reports, dashboards, heatmaps, and other BI presentation assets easily. Furthermore, Looker goes beyond offering rudimentary BI to including support for fresh data, machine learning, performance monitoring and ETL optimizations.

How Looker Works Under the Hood

Looker is a cloud-based platform that can be deployed to offer advanced BI. It works by using LookML to generate SQL code that is sent over a connection to a given database. This allows Looker to connect to different databases for business intelligence purposes.

A Looker project consists of one or more models that provide a connection to a database. A model can contain one or more Explores that provide an interactive webpage where a user can dynamically work with the data. 

image from looker.com

Explores contain views that map to the tables in the given database. Users are then able to issue queries to the database, filter their results, and visualize them in a variety of ways.

While Looker can be used is via direct connections to individual supported data stores, another way is to connect Looker to a distributed query engine like Presto, to enable higher performance, higher concurrent workloads and instant, seamless access to multiple data sources. With Looker + Presto, Looker can become an even more powerful tool.

What is Presto?

Presto is a distributed query engine that allows in-memory processing of data from disparate data sources. It was developed by Facebook and is in use in other large companies including Uber, Twitter, Alibaba, and Amazon as a defacto standard for their SQL workloads.

Faster Result Sets and Unified Access to more Data using Looker and PrestoDB

+

Looker is an extremely powerful BI platform for a front-end facing app. Combining it with a presto-based back-end application enables cutting down query times to seconds rather than minutes. To create a Looker and Presto system, you need to first deploy your presto application. You then create a connection, selecting the dialect as PrestoDB.

Looker then relies on Presto to carry out query processing. The queries are run in a distributed manner and provide access to multiple data sources. Also, this provides a single source of truth in regards to the overall data model. This kind of configuration is used to provide highly scalable enterprise-level BI solutions by using Looker for BI and Presto for distributed data querying.

A typical architecture consists of Tableau connected to a presto cluster with one or more connected data sources. Presto handles the data access and in-memory processing of queries. Tableau handles the visualization of reports and dashboards. This allows the presto cluster to be scaled by adding or removing processing nodes to meet the requirements of the Tableau users. Integrating them offers other benefits such as data federation, fast query processing, and being able to have different clusters that can be optimized to best meet the needs of business analysts and data scientists.

Ahana Cloud is the cloud-native SaaS managed service for Presto, see how you can turbocharge Tableau in 30 minutes!

Ready to Turbocharge your Looker with Presto?

Turbocharge Jupyter with the lightning-fast Presto SQL query engine

What is Jupyter?

Jupyter Notebook is open-source software for creating shareable documents and reports. Each such document can easily host a whole working data pipeline. The .ipynb files can contain code chunks, free and markdown-formatted text, and even visual objects for data analysis.

Jupyter became particularly popular among data scientists and engineers using Python because it works with most Python packages. A Jupyter Notebook can quickly digest and process significant amounts of data, and you just need a couple of command lines to install it using a command shell. 

While Jupyter has no drag-and-drop elements, it still offers a few simple no-code features. The software is called “notebook” due to its interface. In a cell, you can type in a chunk of code, run it, and see the results printed to a small console underneath. Then you add another chunk and repeat the operation, getting the results appended under the second chunk. Since you have your code and all outputs in one place, ordered chronologically, it is very similar to making notes with a paper block.

image from jupyter.org

Jupyter works well for data analysis or data pipeline drafting and testing. If you need some data insights and a report based on it, you can add your comments and annotations directly into the Notebook, run the Kernel, and export the whole document as an HTML file to be sent over. You can also build a data pipeline, load and process your data, export it into a database as an Excel file, or using one of the many other methods available. Once you are satisfied that it runs smoothly and delivers adequate results, you can deploy the pipeline in your main system.

Jupyter has a fairly steep learning curve. Fortunately, its contributors have put together thorough documentation to help new users and enable troubleshooting.

What is Presto

Presto is an open source SQL query engine used to query data lakes and other data stores in differing formats in a distributed and parallel manner. It was originally developed by Facebook to manage its data warehouse developed using Apache Hive. 

The Presto architecture consists of a coordinator node and multiple worker nodes. The coordinator node is responsible for accepting queries and returning results. The worker nodes do the actual computation and connect to the data stores. This distributed architecture makes Presto fast and scalable.

Faster Queries and Unified Access to more Data using Jupyter and Presto

You can use Presto with Jupyter Notebook to expand your data pools.

To connect to your Presto engine, you can use the Presto connector from the pyhive package. All Python-based connectors use similar logic. In the case of Presto, you need to enter your host, post, and then add your catalog and schema. After the connection has been established, you can run SQL queries to fetch the data.

With Presto, you can query both relational and non-relational databases and other data sources. Moreover, you can combine data from more than one data source inside only one query. Presto is also performant – calculations on terabytes of data takes minutes. Presto sends one query to multiple worker nodes using a coordinator node, distributing the workload. The coordinator will collect the data once it’s ready and dispatch it to you.

Ahana Welcomes Satish Ramakrishnan as VP, Engineering to Senior Management Team

Ramakrishnan brings deep SaaS and data management expertise to Presto managed service to lead engineering and technical operations

San Mateo, Calif. – April 7, 2021 Ahana, the self-service analytics company for Presto, announced today the appointment of Satish Ramakrishnan as VP, Engineering, reporting to Steven Mih, Cofounder and CEO. He will head up engineering and technical operations with a focus on advancing Ahana’s vision of simplifying ad hoc analytics for organizations of all shapes and sizes.  

“Satish’s appointment is an important and high-impact addition to the senior team as we expand the capabilities of the industry’s first Presto managed service,” said Mih. “As vice president of engineering, Satish will be integral to helping build out Ahana Cloud for Presto to bring the power of the most powerful open source distributed SQL query engine to any organization. He brings extensive management, technical and organizational leadership to Ahana and has an excellent track record of building high-performing teams and data management services that deliver incredible value.”

“I look forward to helping deliver on Ahana’s vision of bringing the power of Presto to data platform teams of all sizes,” said Ramakrishnan. “From my first meetings with the Ahana team, I knew this was an incredible opportunity to be involved in one of the most popular analytics engines today. I am excited to continue building the managed service platform that delivers on and exceeds customers’ expectations.” 

Satish brings over 22 years of executive experience to Ahana with deep expertise in developing data-focused SaaS services on public cloud providers, including AWS and Google Cloud Platform. Prior to Ahana, Satish was the EVP Engineering at Reltio running engineering and technical operations, where he managed the continued innovation and stabilization of their market leading Master Data Management platform while ensuring revenue growth and operating margin improvement. He has held VP positions at startups like RiseSmart, PointCast, CoreObjects, etc. as well as at large enterprises like BEA and Comcast. At Comcast, he founded the Silicon Valley Innovation Lab where he took an idea from a piece of paper all the way through launch, which became part of the Comcast “Triple Play” offering. Satish holds a B.E. in Computer Science from Mysore University.

Supporting Resources:

Download a head shot of Satish Ramakrishnan. 

Tweet this: @AhanaIO welcomes Satish Ramakrishnan as #VP #Engineering #Presto #PrestoDB #Cloud #Data #OpenSource #Analytics https://bit.ly/3fAXFdj

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto for Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Ahana is a premier member of Linux Foundation’s Presto Foundation and actively contributes to the open source PrestoDB project. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

What are the AWS Glue partition limits and does it apply to AWS Athena?

Typically you’ll use AWS Glue to create the data sources and tables that Athena will query. You could also create your tables using Athena and then use the tables in AWS Glue. Today the limit for AWS Glue partitions is 10M while Athena’s partition limit is 20K partitions per table. So if you’re using AWS Glue you get more partitions if you’re using the AWS Glue catalog with your Athena deployment. 

When it comes to using Amazon Athena, there are a lot of other limits besides partitions including query, concurrent query, and database limits. Many times AWS Athena users find these limits to hinder SLAs and other business-critical applications. As a result, they’re looking for alternatives to AWS Athena and have found that Ahana Cloud’s Presto managed service can address those issues.

Ahana Cloud is a Presto managed service that gives users the power of Presto, similar to AWS Athena, but without the limitations around queries, performance, and control. With Ahana you get the full control of your Presto deployment on AWS, so you can scale as you need without having to worry about partition or query limits.

Learn more about how Ahana Cloud compares to AWS Athena.

Ready to get started? Sign up try out Ahana Cloud for free – if you’re an AWS Athena user, it’s really easy to move from Athena to Ahana Cloud and requires not change in your existing architecture.

You can also check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

Picking the right approach for Presto on AWS:
Comparing Serverless vs. Managed Service

Video
Slides

Ahana Announces New Capabilities for Its Presto Managed Service Including Data Lake Caching, Security and Seamless Operations

Caching advancements improve query performance up to 5x for complex workloads making Ahana the most advanced cloud service for Presto

San Mateo, Calif. – March 24, 2021 Ahana, the self-service analytics company for Presto, announced today at PrestoCon Day significant updates to its Ahana Cloud for Presto managed service. The major areas of focus include performance, better cluster management, ease of use, and security. One of the key features is the data lake IO caching capability that can dramatically improve query performance, reducing latencies up to 80%. 

“Our latest innovations make Ahana Cloud the fastest and most advanced cloud native managed service available for Presto,” said Dipti Borkar, Cofounder and Chief Product Officer, Ahana. “We have seen that advanced capabilities like data lake IO caching can improve query performance up to 5x for real-world concurrent query workloads. This, along with our new ease of use, security and cost management advancements, continue to extend Ahana’s leadership in the market, giving data platform teams more value using the power of Presto for faster insights, easier access, and lower costs. 

“We knew we wanted to use Presto to query all of our AWS data sources, but we found it challenging to manage it on our own in the cloud,” said Omar Mohamed, Data Engineer, Cartona. “Ahana Cloud made it possible for our team to deliver real-time analytics with Presto without having to worry about the complexities. We look forward to doing even more with Ahana’s latest set of capabilities.”

Ahana Cloud for Presto features new capabilities, including:

Performance improvements

  • Data lake IO Caching – Ahana Cloud for Presto is the only cloud-native managed service for Presto and with the new data lake IO caching capability, is now also the only Presto plus caching service for users. Data lake IO caching, built on the RubiX open source project, eliminates the need to read data from data lakes like AWS S3, particularly if the same data is read over and over. Today, Ahana Cloud users can easily enable data lake caching with the click of a button when creating a new Presto cluster. The rest, including attaching SSDs and sizing the cache based on the user selected instance type, is all done automatically. Ahana has also open sourced this feature for the Presto community, which will be available in the upcoming PrestoDB release.
  • Pre-tuned Presto cluster capability – New clusters now come pre-tuned with even more advanced features enabled, including the Cost-based Optimizer and Dynamic Filtering Capabilities which can improve performance, particularly for complex queries, as well as memory tuning to enable best out-of-the-box performance of Presto.

Ease of use – A range of major improvements have been added that make Ahana Cloud the easiest service for Presto on AWS. 

  • Automated and versioned upgrades of Ahana Compute Plane – Versioning and upgrade infrastructure brings users disruption-free upgrades – new features are automatically made available to users. 
  • Ability to deploy different instance types for Presto coordinator and workers, as well as Ahana-managed Hive Metastore Catalog – This brings users flexibility to use Presto for a range of workloads and customize resources provisioned as needed. 
  • Customized IP block allocation for the Ahana-managed Virtual Private Cloud – Allows users with more restrictive networking environments to use Presto in the cloud
  • Support for asynchronous query execution with Apache Superset in the Ahana Compute Plane – Ahana manages Apache Superset so that administrators can query and validate that clusters are available for their data teams. Now they can run long running queries with Apache Superset as well. Ahana manages and automates the deployment of the infrastructure required for these query to run asynchronously to reduce the time platform teams spend on infrastructure  
  • One click creation of AWS provisioning roles and policies needed with an AWS CloudFormation Template – This dramatically simplifies setting up the initial cross-account access between Ahana and the user’s AWS account. 

Security

  • Significantly tighter policies and advanced AWS tagging for all Ahana-managed resources in the user’s account – Ahana’s managed service is deployed in the user’s AWS account, which enables compute to be deployed where the data is without the need for customer data to ever touch Ahana Cloud. With the latest improvements, the policies needed for Ahana management of the compute plane are even tighter and each resource managed by Ahana on behalf of the user is explicitly tagged. This makes the Ahana environment completely isolated and independent from any other user-managed resources.
  • Integration of Ahana Compute plane with AWS CloudTrail, including log validation – AWS CloudTrail is a service that enables governance, compliance, operational auditing, and risk auditing of your AWS account. Now with Ahana’s integration with CloudTrail, you can log, continuously monitor, and retain Ahana-administered account activity and actions across your AWS infrastructure. 

Other new upcoming capabilities in Ahana Cloud for Presto, include:

  • Advanced security with the Apache Ranger integration – You can now use your centralized security administration for fine-grained access control to your Presto clusters. This capability has also been contributed to Presto open source, making it the first open integration for Presto and Apache Ranger. 
  • Better cost management for Presto clusters – Your Presto cluster automatically reduces the number of worker nodes to a single node when the query engine is idle for a certain time. In addition, Ahana can scale clusters up and down based on CPU utilization.
  • Programmatic access to cluster management – Create, start, stop and resize your Presto clusters using our REST API. 

“The convergence of the data warehouse and data lake continues to be a growing trend, and Presto is one of the few technologies that enables ad hoc analytics for a unified analytics approach,” said John Santaferraro, Research Director, Enterprise Management Associates. “Based on interviews, EMA sees the Presto open-source community as vibrant and actively accelerating the value of open source software. Ahana Cloud makes Presto accessible to data platform teams of all sizes so they can use SQL to analyze all of their different data stores.”

Together, these new capabilities help to solidify Ahana Cloud’s position as the only provider of Presto in the cloud and give platform teams of all sizes more capabilities to easily deploy Presto at scale in the cloud.

Supporting Resources:

Tweet this: @AhanaIO announces a set of new capabilities for its Ahana Cloud for Presto managed service #Presto #PrestoDB #Cloud #Data #OpenSource #Analytics https://bit.ly/3tAqV7S

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto for Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Ahana is a premier member of Linux Foundation’s Presto Foundation and actively contributes to the open source PrestoDB project. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

PrestoCon Day Showcases Momentum of Presto Community, Features Industry Leaders Sharing Use Cases and Best Practices

Popular session topics include the Presto and Apache Iceberg Connector by Twitter and the Presto, Today & Beyond panel with Ahana, Uber and Facebook

San Mateo, Calif. – March 17, 2021 Ahana, the self-service analytics company for Presto, announced today their participation in PrestoCon Day, a day dedicated to all things PrestoDB taking place on Wednesday, March 24th. 

Following last year’s massively successful PrestoCon that had hundreds of global attendees and over 20 sessions, PrestoCon Day is a day-long event for the PrestoDB community by the PrestoDB community. In addition to being a premier sponsor of the event, Ahana will be participating in four sessions; two Ahana customers, Carbon and Cartona, will also be presenting their Presto use cases.

“PrestoCon Day will feature a lot of the work that core Presto engineers have been building over the past six months, and we’re looking forward to sharing that with the community,” said Dipti Borkar, Cofounder and Chief Product Officer, Ahana, Program Chair of PrestoCon and Chair of the Presto Foundation Outreach Committee. “In addition to the biggest internet companies like Facebook, Uber and Twitter speaking, we’ll also have modern day SaaS companies presenting how they’re leveraging the power of Presto on AWS with Ahana. We’re thrilled to host an event for the Presto community that helps members, new and seasoned, connect with one another.”

“Presto Foundation was created to accelerate and foster the adoption of Presto, and PrestoCon Day is an important event for the community to showcase its project momentum, usage and advancements,” said Chris Aniszczyk, Vice President, Developer Relations, The Linux Foundation. “We look forward to the conference and bringing the Presto community together for a day of collaboration and community-building.”

Ahana customers speaking at PrestoCon

  • How Carbon uses PrestoDB in the Cloud with Ahana to Power its Real-time Customer Dashboards by Jordan Hoggart, Data Engineer at Carbon
  • Presto on AWS using Ahana Cloud at Cartona by Omar Latif, Data Engineer at Cartona

Ahana sessions at PrestoCon

  • Simplifying Preston on AWS with Ahana Cloud Managed Service by Dipti Borkar, Cofounder & Chief Product Officer at Ahana
  • Using Presto’s BigQuery Connector for Better Performance and Ad-hoc Query in the Cloud by George Wang, Principal Software Engineer at Ahana and Roderick Yao, Product Manager at Google

Panels at PrestoCon 

  • The Presto Ecosystem with speakers from Preset, Apache Hudi, and Apache Pinot, moderated by Dipti Borkar, Cofounder & Chief Product Officer at Ahana
  • Presto, Today & Beyond featuring David Simmen, Cofounder and CTO at Ahana along with speakers from Uber and Facebook, moderated by Dipti Borkar, Cofounder & Chief Product Officer at Ahana

Additionally, industry leaders Facebook, Uber, Twitter and Alibaba will be sharing the latest innovation in the Presto project, including the Prism Presto Gateway Service (Uber), Petabyte Scale Log Analysis (Alibaba), Presto and Apache Iceberg (Twitter), and a Presto State of the Union (Facebook). 

View all the sessions in the full program schedule

PrestoCon Day is a free virtual event and registration is open. Additionally, Ahana is hosting a hands-on virtual lab the day prior to PrestoCon to help Presto users get up and running on AWS. 

Other Resources

PrestoCon Day registration

PrestoCon Day Program Schedule

Virtual Lab: Setting up your AWS S3 data lake and querying it with Presto

Tweet this: .@AhanaIO announces its participation in #PrestoCon Day #cloud #opensource #analytics #presto #prestodb BITLY

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto for Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Ahana is a premier member of Linux Foundation’s Presto Foundation and actively contributes to the open source PrestoDB project. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Upsolver and Ahana Announce Technology Partnership to Accelerate Adoption of Open Data Lake Analytics in the Cloud with Presto

Todd Odess, Head of Partnerships at Upsolver | Ali LeClerc, Head of Marketing at Ahana

Today we’re thrilled to announce our joint technology partnership. Upsolver, the first no-code, high-speed cloud compute layer for making data lakes analytics-ready and Ahana, the managed service for Presto, offer an end-to-end cloud-native solution for open data lake analytics. This solution gives anyone the ability to prepare data for analysis and query. Upsolver’s visual UI lets users easily land and automatically optimize data in AWS S3, and then query it with Ahana Cloud for Presto. 

Customers get to value quickly and affordably, without having to write and maintain Spark-based data pipelines or manage complex query engines. Over the long-term, they benefit from greater analytics agility, being able to productionize new use cases in days rather than weeks or months. 

Companies today collect and manage massive amounts of data – on AWS S3 alone there are 10,000+ data lakes. But studies show that companies only analyze 12% of their data; the rest is locked up in siloed systems, expensive on-premises solutions, and incompatible data formats. Open Data Lake Analytics helps users unlock that data. 

Open Data Lake Analytics is an approach to traditional data analytics that opens up a wider range of data for analytics in the cloud. It’s built on open source Presto, the distributed SQL query engine, and supports open formats (Apache Parquet, Apache ORC, and more) and open interfaces that integrate with existing SQL systems. It can also run on any cloud. 

Together, Ahana and Upsolver give organizations a cloud-native platform that analyzes large volumes of data at any speed – real-time or batch – automating the convoluted process of building cloud data lakes plus managing Presto. Working together enables data teams to manage complex pre-query data engineering and a distributed query engine without having deep technical knowledge. 

Upsolver provides a visual, SQL-based interface to easily create and manage tables, as well as deep integration with AWS S3, AWS Glue Catalog, and PrestoDB for high performance queries. 

We believe this partnership that provides open data lake analytics is critical to maximize the value of our customers’ data lakes. Together we accelerate time to value on data lake analytics projects and make it simple enough for any data practitioner to analyze any dataset.

This joint solution brings the power of technologies used at internet-scale companies like Facebook, Uber, Twitter to data teams of any size. It helps companies expand beyond the cloud data warehouse to a multifaceted analytical stack, opening up a broad set of advanced use cases and the flexibility.

We look forward to this new partnership with Upsolver and making the open cloud data lake accessible for everyone.

Additionally, Ahana and Upsolver are both part of the AWS Partner Network and premier members of the Linux Foundation’s Presto Foundation, a foundation that governs the community-driven PrestoDB project.

Ahana and Upsolver will be holding a free virtual hands-on workshop on March 23 to help users get up and running with the platform. Registration is now open.

The joint solution is available today, users can select Ahana as an output for Upsolver:

Send us a note if you’re interested in learning more! Looking forward to seeing all you do with Upsolver and Ahana 😀

How do I get deterministic performance out of Amazon Athena?

What is Athena?

Amazon Athena is an interactive query service based on Presto that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage. This approach is advantageous for interactive querying on datasets already residing in S3 without the need to move the data into another analytics database or a cloud data warehouse.

What is Athena great for?

Users love Athena for the simplicity, ease of getting started, and the fact that no servers need to be managed. You only pay for the amount of data scanned. Athena is closely integrated with glue, so if you are already using the glue catalog for your ETL workloads, all your tables are already defined and accessible. You can then make these tables available for interactive analytics to your extended team.

What are the trade-offs?

The simplicity of the deployment architecture of Amazon Athena, however, comes at a price of inconsistent performance with scale, as many users of Athena have already experienced.

There are two primary trade-offs with Athena. Firstly, transparency – “you get what you get” – you have no visibility to the underlying infrastructure serving your queries nor do you have the ability to control or tweak that infrastructure for your workloads. The approach works for cases where the performance/latency is not critical, but it can be a non-starter for users who need control. Secondly, despite all its advantages, shared services serverless models have disadvantages related to performance predictability.

AWS has documented best practices in their performance tuning guide to get the most out of Athena and to avoid typical errors that users encounter such as “Query exceeded local memory limit” and “Query exhausted resources at this scale factor” these include using optimized formats such as parquet and orc as well as avoid small files and partitioning the data appropriately.

How does the serverless architecture impact my query performance?

Athena at its core is a shared serverless service per region – to protect the usage spiraling out of control by a handful of customers Amazon has placed restrictions on the usage, size, concurrency, and scale of the service on a per-customer basis and within the region overall.  

These limits and restrictions are well documented in the service limits such as 20 active DDL and DML Queries each in most regions, 30-minute max query timeouts, API limits, and throttling among others (Some of these limits can be tweaked by working with Amazon support). These limits are guard rails around the service so that the usage of one customer doesn’t adversely affect the experience of another. These guardrails are however far from perfect since there are only a finite number of resources per region to be shared across customers in the region. Any excessive load due to seasonal or unexpected spikes at Amazon’s scale will easily consume the shared resources causing contention and queuing. 

In addition to the shared query infrastructure, Athena’s federated query connectors are based on Lambda, which is again serverless. Lambda scales out well and can be performant once warmed-up, however consistent performance comes only with consistent use. Depending upon the usage of a particular connector in the region and the available capacity of the backend infrastructure you could run into latencies caused by cold-starts, especially if you are using connectors that are not accessed frequently e.g. custom connectors.

If a large number of users end up using Athena at the same time, especially for large-scale queries, they often observe extended queuing of their queries. Though the eventual query execution might not take time once resources are available, the queueing significantly impacts the end-user experience for interactive query workloads. Users have also at times reported inconsistency of execution times of the same query from one run to another which ties back into the shortcomings of the shared resources model. 

So can I get deterministic performance out of Athena? 

If your queries are not complex, latency-sensitive and your infrastructure is in a less crowded region, you may not encounter performance predictability issues frequently. However, your mileage entirely depends upon several factors such as when you are running the query, which region are you running the query, the volume of the data you are accessing, your account service-limit configurations, to just name a few. 

What are my options?

If your interactive query workload is latency-sensitive and you want to deterministically control the performance of your queries and the experience of your end-users, you need dedicated managed infrastructure. A managed Presto service gives you the best of both worlds – It abstracts the complexity of managing a distributed query service at the same time giving you the knobs to tweak the service to your workload needs.

Ahana provides a managed Presto service that can scale up and down depending on your performance needs. You can segregate workloads into different clusters or choose to share the cluster. You can also choose beefier infrastructure for more business and time-critical workloads and also set up separate clusters for less critical needs. You make that choice of price, performance, and flexibility depending upon business objectives.

Do I have to use AWS Lambda to connect to data sources with Athena?

The Athena Federated Query Journey

AWS announced the public preview of Athena federated query in November 2019 and moved the service to General Availability (GA) in November 2020. As of this writing, Athena supports two versions of Athena – engine 1 based on Presto 0.172, and engine 2 based on Presto 0.217. Among other features, the federated query functionality is only supported on engine 2 of Athena. 

What connectors are supported?

Athena currently supports a number of data sources connectors

  • Amazon Athena CloudWatch Connector
  • Amazon Athena CloudWatch Metrics Connector
  • Athena AWS CMDB Connector
  • Amazon Athena DocumentDB Connector
  • Amazon Athena DynamoDB Connector
  • Amazon Athena Elasticsearch Connector
  • Amazon Athena HBase Connector
  • Amazon Athena Connector for JDBC-Compliant Data Sources (PostgreSQL, MySQL, and Amazon Redshift)
  • Amazon Athena Neptune Connector
  • Amazon Athena Redis Connector
  • Amazon Athena Timestream Connector
  • Amazon Athena TPC Benchmark DS (TPC-DS) Connector

In line with the serverless model of AWS Athena, in order to maintain an impedance match between Athena and the Connectors, the connectors are also Serverless based on AWS Lambda and run within the region. These connectors are open-sourced under the Apache 2.0 license and are available on GitHub or can be accessed from the AWS Serverless Application Repository. Third-party connectors are also available in the serverless application repository.

Athena is based on Presto, Can I use the open-source Presto data source connectors with Athena?

Presto has an impressive set of connectors right out of the box, these connectors however cannot be used as-is with Athena. The Presto service provider interface (SPI) required by the Presto connectors is different from AWS Athena’s Lambda-based implementation which is based on the Athena Query Federation SDK.

You can use the Athena Query Federation SDK to write your own connector using Lamba or to customize one of the prebuilt connectors that Amazon Athena provides and maintains. The Athena connectors use Apache Arrow format for returning data requested in the query. An example Athena connector can be found here.

If you’re looking for a managed service approach for PrestoDB, Ahana Cloud offers that based on open source Presto. Ahana’s managed deployment is 100% compatible with the open source Presto connectors, and you can get started with a click of a button.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

How do I do geospatial queries and spatial joins in Presto?

A question that often comes up is “how do I do geospatial queries and spatial joins in Presto?”. Fortunately Presto supports a wealth of functions and geospatial-specific joins to get the job done.

Let’s get started with a step-by-step tutorial. First we’ll set-up some test data in two tables.  The first table is trips_table which will store 3 rows each representing simple taxi trips. We store a trip id, the origin and destination long/lat coordinates, and the journey start time:

create table trips_table (
trip_id int, 
orig_long double, 
orig_lat double, 
dest_long double, 
dest_lat double, 
trip_start timestamp);

insert into trips_table values (1, 51.50953, -0.13467, 51.503041, -0.117648, cast('2021-03-02 09:00:00 UTC' as timestamp));
insert into trips_table values (2, 34.039874, -118.258663, 34.044806, -118.235187, cast('2021-03-02 09:30:00 UTC' as timestamp));
insert into trips_table values (3, 48.858965, 2.293497,48.859952, 2.340328, cast('2021-03-02 09:45:00 UTC' as timestamp));
insert into trips_table values (4, 51.505120, -0.089522, 51.472602, -0.489912, cast('2021-03-02 10:45:00 UTC' as timestamp));

For information:

  • Trip 1 is a ride within central London, from Piccadilly Circus to the London Eye.
  • Trip 2 is a ride in downtown Los Angeles
  • Trip 3 is a ride from the Eiffel Tower to Musée du Louvre, Paris 
  • Trip 4 is a ride from Borough Market in central London to Heathrow Airport Terminal 5 (outside central London).

The second table is city_table with each row storing the shape representing an area e.g. central London, and a name for the shape. We represent the shape with a sequence of coordinates that enclose a specific area: 

create table city_table (
geo_shape varchar, 
name varchar);

insert into city_table values ('POLYGON((51.519287 -0.172316,51.519287 -0.084103,51.496393 -0.084103,51.496393 -0.172316,51.519287 -0.172316))', 'London, central');
insert into city_table values('POLYGON((33.9927 -118.3023,33.9902 -118.1794,34.0911 -118.2436,33.9927 -118.3023))', 'Los Angeles, downtown');

For information:

  • I used a simple triangle to represent downtown LA (see illustration below)
  • I used a rectangle representing central London. 
  • In each case the first pair coordinates for the shape are the same as the last pair – so it’s an enclosed bounding box or polygon we’re describing.
  • We’re storing our shapes as text in a varchar column for simplicity.
  • We describe each polygon as comma-separated pairs of long/lat coords using the POLYGON(()) function.  The double brackets are required. 
Simple polygon to represent downtown LA 
(credit: “Geospatial Data Analysis using SQL”)

Now let’s run a query to count how many trips occurred in each city. We join our two tables, and we use each journey’s originating long/lat coordinates to determine – using ST_CONTAINS() – if that point exists in any of our shapes.  This function requires the polygon to be expressed as a special type – Geometry – so we convert our shape from text using ST_GeometryFromText() function:

SELECT c.name as City, count(*) as Trips 
FROM trips_table as t 
JOIN city_table as c 
ON ST_Contains(ST_GeometryFromText(c.geo_shape), st_point(t.orig_long, t.orig_lat)) 
GROUP BY 1;

         City          | Trips 
-----------------------+-------
 Los Angeles, downtown |     1 
 London, central       |     2 
(2 rows)

We see both London trips made it into the result set, despite one of the trips ending at the airport which is a way outside the shape we defined for central London – this is because the query uses the originating coordinates for each trip, not the destination coordinates. 

Also notice the Paris trip didn’t make it into the result – this is because we did not define a shape for Paris.

In this example you’ve seen some of the benefits of using Ahana Cloud for Presto

Presto’s Geospatial functions are listed in the Presto documentation.  

We have a ton of resources to help you get started with Presto, check them out here.

How do I query JSON documents with Presto?

JSON documents are a common data type. A lot of people collect logs and load them into s3. Then in order to query JSON with Presto can be challenging because you’re not always sure of what the schema is. Because it’s JSON, the schema can be different, it won’t always have a consistent schema. The great part about Presto is that it has functionality that enables you to get insights on your JSON data.

Presto has a JSON data type. The JSON data type is similar to a string. One way to think about this is that it is just a string containing JSON, with the exception that the data inside the string has already been parsed, it’s already well-formed JSON.

But it’s also slightly semantically different in that JSON has maps. And when you have a string with a map, the keys of the map can be ordered differently. But in the JSON data type, the keys in the map are always going to be in a consistent order. And it’s worth noting that two maps of the data with different JSON data type values will compare the same, even if the keys are in different order. This can be important when doing an aggregation regrouping or a comparison. Logically two maps will be treated as the same value, no matter what the order is.  

How the JSON parse function works

The JSON parse function takes a JSON string and returns a JSON data type. Here is an example of the JSON data type:

SELECT json_parse('null'); -- JSON 'null'
SELECT json_parse('true'); -- JSON 'true'
SELECT json_parse('42'); -- JSON '42'
SELECT json_parse('"abc"'); -- JSON '"abc"'
SELECT json_parse('[1, 2, 3]'); -- JSON '[1,2,3]'
SELECT json_parse('["a": 1, "b": 2]'); -- JSON '["a": 1, "b": 2]'
 
SELECT json_parse('[”hey”, 42, {“xyz”: 100, “abc” : false}]’); 

This example has an array that contains a string, a number, and then a map. And we can see that the map has two keys, xyz, and xyz, you can see that zyz in the input comes before abc, so the keys are not in order.

After it’s parts with json_parse, the result is that the keys in the map are now ordered. This is how you result with the same JSON data type and how they compare the same. When two different maps with differently ordered keys are parsed, they’ll have the same contents.

In Presto, all the JSON functions take the JSON data type. And if you pass a string into the JSON function, it will be implicitly parsed into JSON.

Another question that comes up is what is the difference of NULL between JSON and SQL? J_null and a SQL null?

This can be confusing because JSON has a NULL and so does SQL. These are two different things. A SQL NULL is considered a special value and work very differently like they do in other programming languages. Operations on a NULL gets turned into a NULL. Whereas a JSON NULL is just another value, like a number or a string or an array. So they’re totally different. This has implications when doing Boolean expressions.

We have a lot more resources on Presto if you’d like to learn more, check out our Getting Started with Presto page.

Is there latency overhead for Presto queries if everything fits into memory and doesn’t need to be distributed?

Presto is both in-memory and distributed, so each work has memory and uses it. However, Presto is not an in-memory database. Presto is the query engine and reads from storage underneath. This is where Presto’s caching capabilities become important.

If all the data required to satisfy a given query arrives from the connected data sources and fits within the memory of the Presto cluster, everything should work accordingly. However, Presto’s MPP, in-memory pipelining architecture will accelerate workloads further in two ways:

1) Presto will ensure data is divided across all the workers to bring maximum processing power and memory capacity to bear.

2) Presto’s generated query plan will ensure execution steps are distributed amongst all the workers such that processing takes place in parallel on the data across all workers simultaneously, as efficiently as possible. Furthermore, with its caching capabilities, Presto can accelerate query execution even further for specific workload patterns, further reducing latency.

If you want to get started with Presto, check out our docs to learn more about the Ahana Cloud managed service for Presto.

Is the Hive metastore a hard dependency of Presto, or could Presto be configured to use something else like Postgres?

With Presto, there’s no hard dependency of having to use the Hive metastore – it’s catalog-agnostic. However, there are significant advantages like better performance when you use the Hive metastore.

A managed service like Ahana Cloud for Presto provides a managed version of a Hive metastore catalog, so you don’t have to deal with the complexity of managing their own metastore. Ahana also supports AWS Glue as a data catalog service.

You don’t necessarily have to use the metastore if you are connecting via the provided connectors such as MySQL, PostGres, Redshift, Elastic, etc. But if you intend to query data on S3, for instance, you will need a metastore (i.e. Hive or Glue) to define external tables and their formats.

Presto has many connectors including Postgres, which is a supported data source.

What are the differences between Presto and Apache Drill?

Drill is an open source SQL query engine which began life as a paper “Dremel: Interactive Analysis of Web-Scale Datasets” from Google in 2010. Development of Apache Drill began in 2012.

Performance & Concurrency
Drill has never enjoyed wide adoption partially because it was tied to one Hadoop distribution (MapR) but mainly because of inherent performance and concurrency limitations. There are companies that built products based on Drill who report these performance and concurrency issues, and many have migrated away from Drill as a result. Presto’s popularity and adoption, on the other hand, has exploded as numerous companies from SME to web-scale have deployed Presto and contribute to the project. These include Facebook, Uber, Alibaba, Twitter, Netflix, AirBnB and LinkedIn. Users value Presto’s vendor-neutrality and rate of innovation.

Metastores
Presto connects to external metastores (AWS Glue, Hive Metastore Catalog); many users deploy Presto + AWS Glue/Hive for their data lake analytics. In addition, schema for relational data sources can be obtained directly from each connector. On the other hand, Drill performs its own schema discovery and does not need a metastore such as Hive (but can use Hive if needed), which is a benefit.

Community
Compared to Presto, the Apache Drill community has dwindled somewhat especially when it comes to adoption and contributions. We talk to many users who are looking to move from Apache Drill to Presto. If we look at DB engines – https://db-engines.com/en/ranking/relational+dbms – we see that Presto has continued on a positive upward trend and ranks #27, as opposed to Drill which is at #49 in a downward trend.

Overall, Drill is in decline. This was accelerated when HPE (Hewlett Packard Enterprise, who acquired MapR) announced they will no longer support or contribute to Drill (a drawback of having a query engine tied to an ecosystem such as Hadoop). Presto’s popularity continues to increase, as illustrated by 1) the number of commits to the project (In the past month 46 authors have pushed 79 commits to master and 88 commits to all branches, and on master 568 files have changed and there have been 10,930 additions – as of Feb 26 2021), and 2) Presto’s continued rise on DB-engines.

Drill is a top-level Apache Foundation project but does not have a strong and active community behind it. Presto is backed by a strong community and is overseen by the Presto Foundation which is part of the Linux Foundation with 8 premier organizations driving it including Facebook, Uber, Twitter, Intel and Ahana.

Want to learn more about PrestoDB? Check out our free tutorials to help you get started.

Why am I getting zero records when I use AWS Athena to query a CSV file?

There’s a common error many AWS Athena users see when they query CSV files – they will get zero records back. This tends to happen when they run the AWS Glue crawler and create a new table on the CSV file. There are several reasons why the query might return no records.

  1. The crawler is pointing to a file instead of an AWS S3 bucket
  2. The LOCATION path is wrong or there’s a double slash in it
  3. The partitions haven’t loaded into the Glue Data Catalog or the internal Athena data catalog
  4. In general, CSV crawlers can be sensitive to different issues (i.e. embedded newlines, partially quoted files, blanks in integer fields), so make sure everything is accurate

On the first point, if you have selected a file instead of your S3 bucket, the crawler will succeed but you won’t be able to query the contents which is why you’ll see the ‘Zero Records Returned’ error message. If there are other files that you don’t want crawled, you’ll need to create a new folder and move your CSV to that new folder, and then update the include path accordingly (and you’ll need to re-crawl it).

In order to get your S3 data lake to work, you’ll need to make sure each batch of same-schema files has its own top level directory.

If you’re running into issues like this with AWS Athena and want to offload the management while still getting the power of Presto, check out Ahana Cloud. It’s a fully managed service for AWS that removes these types of complexities and makes it really easy to run Presto in the cloud.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

Does Presto work natively with GraphQL?

Some users may have a primary data store that is GraphQL-based (AWS AppSync) and want to leverage Presto. For context, GraphQL falls in the application as part of the service mesh infrastructure. It is an abstraction API for mostly operational sources.

Typically for analytics, users move their data into an analytical stack, because the properties and queries of an operational stack and analytical stack are quite different.

Users with your environment move the data to S3 to run data lake analytics. Presto is the de facto engine for data lake analytics, but in addition to that is also allows for running queries across S3 and operational systems like RDS.

Running analytics on GraphQL / AppSync is not recommended because:

  1. GraphQL isn’t really a query engine with an optimizer, etc.
  2. Analytical queries are fairly intensive from a compute perspective and needs the right analytical engine to process and execute the queries. GraphQL is mostly a pass through and this will affect your operational systems.

Want to learn more about Presto? Download our free whitepaper: What is Presto?

What is a Presto Connection and how does it work?

Connections to and from Presto fall into two main categories:  

1) The Presto connectors that provide access to source data in various databases and file systems. The connectors are listed here: https://prestodb.io/docs/current/connector.html. One of the most commonly used Presto connectors is Hive which gives query access to HDFS and S3 data stored in a variety of formats.

2) Presto’s interfaces that allow Presto clusters to be accessed by tools and applications. There are four primary interfaces:

API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example., you can make a simple REST call to Presto to get a JSON dump of recently run queries using the syntax:

http://<prestoServerHost>:<port>/v1/query

The default port is 8080.

You can optionally specify a query ID – in this example the query ID is 20200926_204458_00000_68x9u:

JDBC

Presto can be accessed using SQL from Java using the JDBC driver. Download link is in the documentation: https://prestodb.io/docs/current/installation/jdbc.html. The following JDBC URL connection string formats are supported:

http://myHost:8080/v1/query/20200926_204458_00000_68x9u

jdbc:presto://host:port

jdbc:presto://host:port/catalog

jdbc:presto://host:port/catalog/schema

Here’s example Java code to establish a connection to a Presto cluster:

String sql = "SELECT * FROM sys.node";
String url = "jdbc:presto://localhost:8080/catalog/schema";
try (Connection connection =
        DriverManager.getConnection(url, "test", null)) {
    try (Statement statement = connection.createStatement()) {
        try (ResultSet rs = statement.executeQuery(sql)) {
            while (rs.next()) {
                System.out.println(rs.getString("node_id"));
            }
        }
    }
}

ODBC

Several free and paid-for options exist:

Client libraries

Presto libraries for C, Go, Java, node.js, PHP, Python, R, Ruby are available at https://prestodb.io/resources.html#libraries  

We hope the above information helps you with creating Presto Connections.

Why does a single AWS Athena query get stuck in QUEUED state before being executed?

One of the drawbacks of AWS Athena is the fact that as a user, you don’t have control over query performance and predictability. One specific issue that comes up quite a bit for Athena users is single queries getting stuck in a QUEUED state before being executed. This happens because of Athena’s serverless nature in combination with its multi-tenant service – you are essentially competing for resources when you run a query.

Many users will get the message “Query waiting in queue” somewhat randomly. For this reason, it’s recommended to not use Athena for user-facing requests because of its unpredictability, and generally many users have decided they can’t rely on Athena at all based on this issue. If you can’t predict demand and manage multi-tenancy, it’s not ideal for ad-hoc analytics (which is a core use case for PrestoDB).

For users who still want to leverage PrestoDB for ad-hoc analytics and data lake analytics, Ahana Cloud is a solution that allows you to run Presto transparently as a managed service. You won’t run into the issues around unpredictability and queries getting stuck. There’s a 14 day free trial if you’re interested in checking it out.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

Presto Data Share

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Presto enables data sharing in two ways. 

1) With its broad list of data connectors, Presto provides a simple way to exploit data virtualization with federated SQL queries accessing data across multiple data sources.  Data in RDBMSs, NoSQL databases, legacy data warehouses, files in object storage can all be accessed,  and combined in a single SQL query if required.  Presto allows querying data where it lives, like in Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization by a variety of users and applications.  A complete list of Presto’s connectors can be found at  https://prestodb.io/docs/current/connector.html 

2) sharing access to data by connecting your Presto cluster to other systems, tools and applications is achieved using the Presto API or using SQL via client libraries, ODBC or JDBC interfaces:

API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example., you can make a simple REST call to Presto to get a JSON dump of recently run queries using the syntax:

http://<prestoServerHost>:<port>/v1/query 

The default port is 8080.

You can optionally specify a query ID – in this example the query ID is 20200926_204458_00000_68x9u:

http://myHost:8080/v1/query/20200926_204458_00000_68x9u

JDBC

Presto can be accessed using SQL from Java using the JDBC driver. Download link is in the documentation: https://prestodb.io/docs/current/installation/jdbc.html. The following JDBC URL connection string formats are supported:

jdbc:presto://host:port

jdbc:presto://host:port/catalog

jdbc:presto://host:port/catalog/schema

Here’s example Java code to establish a connection:

String sql = "SELECT * FROM sys.node";
String url = "jdbc:presto://localhost:8080/catalog/schema";
try (Connection connection =
        DriverManager.getConnection(url, "test", null)) {
    try (Statement statement = connection.createStatement()) {
        try (ResultSet rs = statement.executeQuery(sql)) {
            while (rs.next()) {
                System.out.println(rs.getString("node_id"));
            }
        }
    }
}

ODBC

Several free and paid-for options exist:

Client libraries

Presto libraries for C, Go, Java, node.js, PHP, Python, R, Ruby are available at https://prestodb.io/resources.html#libraries 

When it comes to using Presto to share data, we hope the above information is useful.

Presto Performance

Presto is an open source distributed ANSI SQL query engine for analytics. Presto supports the separation of compute and storage (i.e. it queries data that is stored externally – for example in Amazon S3 or in RDBMSs).  Performance is particularly important for ad-hoc analytical queries so Presto has a number of design features to maximize speed such as in-memory pipelined execution, a distributed scale-out architecture, and massively parallel processing (MPP) design.  In terms of specific performance features Presto supports:

  • Compression  (SNAPPY, LZ4, ZSTD, and GZIP)
  • Partitioning 
  • Table statistics – collected by the ANALYZE command and stored in a Hive or Glue metastore – give Presto’s query planner insights into the shape, size and type of data being queried, and whether or not the data source supports pushdown operations like filters and aggregates. 
  • Presto uses a cost-based optimizer, which as you would often expect depends on collected table statistics for optimal functioning. 

As Presto is an in-memory query engine, it can only process data as fast as the storage layer can provide it. There are MANY different types of storage that can be queried by Presto, some faster than others. So if you can choose the fastest data source this will boost Presto’s speed. This may involve tuning the source to reduce latency, increase throughput, or both. Or switching from accessing a data source that is busy dealing with lots of users and therefore high levels of contention, to an alternative – perhaps a read replica of a database. Or creating indexes, or accessing a pre-aggregated version of the data. Or perhaps moving portions of frequently used data from object storage to a faster storage layer like a RDBMS in order to meet a strict query SLA.  Other suggestions include switching to one of Presto’s supported file formats that features performance optimizations like ORC or Parquet, and consider enabling compression. 

Presto Performance Tips

The following recommendations can help you achieve maximum performance from your Presto clusters:  

  • Configure Presto’s coordinator and workers to run on separate instances/servers in production deployments. It is only recommended to have the coordinator and worker share the same instance for very small scale dev/test use. 
  • Always adjust Presto’s java memory configuration according to the available memory resources of your instance. There is no “default”, so the etc/jvm.config file on each node needs to be configured before your start Presto.  A useful rule of thumb is:  In each node’s jvm.config set -Xmx to 80% of the available physical memory initially, then adjust later based on your monitoring of the workloads. 
  • If using HDFS or S3 storage for example, consider using ORC format for your data files. There are numerous optimisations in Presto for ORC such as columnar reads (Presto is a columnar query engine), predicate pushdown, and  skipping reads of portions of files if the data is sorted (or mostly sorted) on the filtering columns.
  • Use partitioning. You can create a partitioned version of a table with a CTAS https://prestodb.io/docs/current/sql/create-table-as.html by adding  the partitioned_by clause to the CREATE TABLE.
  • Use bucketing.  Do this by adding the bucketed_by clause to your CREATE TABLE statement. You will also need to specify bucket_count. 
  • If you have a choice of metastore, choose Hive instead of Glue for your Presto cluster(s). Hive has some features that Glue does not, like column-level statistics and dynamic filtering which can boost query performance. The final decision will depend on your particular mix of systems and feature requirements.  Here’s a summary of the differences:

  • Collect table statistics to ensure the most efficient query plan is produced, which means queries run as fast as possible.  Use the sql ANALYZE TABLE <tablename>  command to do this. Repeat the ANALYZE TABLE commands for all tables involved in queries on a regular basis, typically when data has substantially changed (e.g. new data arrived  / after an ETL cycle has completed).
  • In conjunction with the above, if you are exploiting partitioning, make sure you update the partitioning information that’s stored in your metastore.  For example to sync the metastore with the partitioning in the table default.customer  use CALL system.sync_partition_metadata(‘default’, ‘customer’, ‘full’); Do this right after you create the table, and repeat this when new partitions are added.
  • Presto does not do automatic JOIN re-ordering by default. This only happens when the cost-based optimisation (CBO) feature is enabled (see below). By default you (or the application) need to make sure that smaller tables appear on the right side of the JOIN keyword. Remember: “LARGE LEFT” (put the large table on the left side of the join). 
  • If your queries are performing table joins you should try enabling the cost-based optimisation (CBO) feature – use: 
    • SET session join_distribution_type=’AUTOMATIC’;  
    • SET session join_reordering_strategy=’AUTOMATIC’; 

(both SETs are needed – and are persisted until session has ended/log out).  

  • You should enable Dynamic Filtering when 1 or more joins are in-play, especially if there’s a smaller dimension table being used to probe a larger fact table for example. Dynamic Filtering is pushed down to ORC and Parquet readers, and can accelerate queries on partitioned as well as non-partitioned tables. Dynamic Filtering is a join optimization intended to improve performance of Hash JOINs. Enable this with:
    • SET session enable_dynamic_filtering=TRUE;
  • If practical, try ordering/sorting your tables  during ingestion. This can greatly improve performance – especially the effectiveness of Dynamic Filtering. 
  • Monitor for Coordinator node overload. If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analysing, planning and optimising queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help.
  • Choose the right instances for your workers to ensure they have enough I/O.  Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 instances:

It’s a good idea to monitor the IO activity of your worker nodes to determine if there’s an IO bottleneck. 

  • Consider enabling Resource Groups. This is Presto’s workload manager and it’s used to place limits on resource usage, and can enforce queueing policies on queries that run within them or divide their resources among sub-groups. A query belongs to a single resource group, and consumes resources from that group (and its ancestors). A resource group represents the available Presto resources packaged together, with limits related to CPU, memory, concurrency, queueing, and priority.  Except for the limit on queued queries, when a resource group runs out of a resource it does not cause running queries to fail; instead new queries become queued. A resource group may have sub-groups or may accept queries, but may not do both. More details in the documentation: https://prestodb.io/docs/current/admin/resource-groups.html 


Finally, to assist with speed tests Presto has TPC-DS and TPC-H catalogs built-in to generate data for benchmarking purposes at varying scales.  For example the 1TB TPC-H dataset consists of approximately 8.66 billion records, in 8 tables. 

Need more speed? Deploy Presto in the cloud and then scale your cluster to as many or as few instances as you need, when you need them.  This “elasticity” is being made increasingly automatic by Ahana’s fully managed cloud-native Presto service. 

We hope these Presto performance recommendations help you get the most out of Presto. If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS.

Presto REST API

Presto’s HTTP API is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example, you can make a simple REST call to Presto to get a JSON dump of recently run queries using this syntax:

 http://<prestoServerHost>:<port>/v1/query

The default port for Presto is 8080 for non-secure clusters.

You can optionally specify a query ID – in this example the query ID I’m interested in is 20200926_204458_00000_68x9u:

https://myHost:443/v1/query/20200926_204458_00000_68x9u

The default port for secure Presto clusters is 443. 

Here’s a worked example using curl, specifying a user ID/password. Stats for the simple test query – select from now() – can be seen:

$ curl https://myHost:443/v1/query -u james
Enter host password for user 'james': *******


[{"queryId":"20210119_192148_00000_r4adv","session":{"queryId":"20210119_192148_00000_r4adv","transactionId":"26387f6e-6f5b-41a8-bac1-2fc4fed51e04","clientTransactionSupport":true,"user":"james","principal":"james","source":"presto-cli","timeZoneKey":2072,"locale":"en_GB","remoteUserAddress":"xxx.xxx.xxx.xxx","userAgent":"StatementClientV1/0.245.1-9966d7d","clientTags":[],"resourceEstimates":{},"startTime":1611084108707,"systemProperties":{},"catalogProperties":{},"unprocessedCatalogProperties":{},"roles":{},"preparedStatements":{}},"resourceGroupId":["global"],"state":"FINISHED","memoryPool":"general","scheduled":true,"self":"http://xxx.xxx.xxx.xxx:8585/v1/query/20210119_192148_00000_r4adv","query":"select now()","queryStats":{"createTime":"2021-01-19T19:21:49.034Z","endTime":"2021-01-19T19:21:50.014Z","queuedTime":"21.28ms","elapsedTime":"980.19ms","executionTime":"916.70ms","totalDrivers":17,"queuedDrivers":0,"runningDrivers":0,"completedDrivers":17,"rawInputDataSize":"0B","rawInputPositions":0,"cumulativeUserMemory":0.0,"userMemoryReservation":"0B","totalMemoryReservation":"0B","peakUserMemoryReservation":"0B","peakTotalMemoryReservation":"0B","peakTaskTotalMemoryReservation":"0B","totalCpuTime":"25.00ms","totalScheduledTime":"78.00ms","fullyBlocked":true,"blockedReasons":[],"totalAllocation":"0B","progressPercentage":100.0},"queryType":"SELECT","warnings":[]}]

Presto Parquet

Parquet is a columnar storage format for Hadoop, supported by Presto.  Storing data as columns as opposed to rows enables Presto to fulfill queries more efficiently. By discarding unwanted data in rows, and preventing unnecessary scans, columnar storage saves disk space and improves query performance for larger data sets.

In Parquet, data is first horizontally partitioned into groups of rows, then within each group, data is vertically partitioned into columns. Data for a particular column is stored together via compression and encoding to save space and improve performance. Each Parquet file has a footer that stores codecs, encoding information, as well as column-level statistics, e.g., the minimum and maximum number of column values. 

There have been several improvements made to Presto’s Parquet reader by the community, most notably by Uber, to enhance performance with features such as pushdown.  

For developers, a useful tool-set for working with Parquet files is parquet-tools which allows you to see the schema contained in the file, view data, examine data types etc.  Parquet-tools can be installed by pip https://pypi.org/project/parquet-tools/ or installed on a Mac with brew install parquet-tools or cloned and built from https://github.com/apache/parquet-mr

Presto’s Parquet support is a popular choice for data storage.

Presto Speed

Presto is an open source distributed ANSI SQL query engine for analytics. Presto supports the separation of compute and storage (i.e. it queries data that is stored externally – for example in Amazon S3 or in RDBMSs).  Efficiency and speed are important for query performance so Presto has a number of design features to maximize speed such as in-memory pipelined execution, a distributed scale-out architecture, and massively parallel processing (MPP) design. In terms of specific performance features Presto supports:

  • Compression  (SNAPPY, LZ4, ZSTD, and GZIP)
  • Partitioning 
  • Table statistics – collected by the ANALYZE command and stored in a Hive or Glue metastore – give Presto’s query planner insights into the shape, size and type of data being queried, and whether or not the data source supports pushdown operations like filters and aggregates. 
  • Presto uses a cost-based optimizer, which as you would often expect depends on collected table statistics for optimal functioning. 

As Presto is an in-memory query engine, it can only process data as fast as the storage layer can provide it. There are MANY different types of storage that can be queried by Presto, some faster than others. So if you can choose the fastest data source this will boost Presto’s speed. This may involve tuning the source to reduce latency, increase throughput, or both. Or switching from accessing a data source that is busy dealing with lots of users and therefore high levels of contention, to an alternative – perhaps a read replica of a database. Or creating indexes, or accessing a pre-aggregated version of the data. Or perhaps moving portions of frequently used data from object storage to a faster storage layer like a RDBMS in order to meet a strict query SLA.  Other suggestions include switching to one of Presto’s supported file formats that features performance optimizations like ORC or Parquet, and consider enabling compression. 


To assist with speed tests Presto has TPC-DS and TPC-H catalogs built-in to generate data for benchmarking purposes at varying scales.  For example the 1TB TPC-H dataset consists of approximately 8.66 billion records, in 8 tables. 

Need more Presto speed? Deploy Presto in the cloud and then scale your cluster to as many or as few instances as you need, when you need them.  This “elasticity” is being made increasingly automatic by Ahana’s fully managed cloud-native Presto service

Presto Tools

Several tools  and libraries are available, some are bundled with core Presto and others are downloadable extras. See the documentation Resources page at  https://prestodb.io/resources.html for details. 

Core Presto Tools:

  • Presto Console – Included with Presto, this browser-based UI allows administrators to monitor many aspects of their Presto deployment in real time, such as query workload, memory usage, execution plans etc. 

Community Contributed Tools 

Several community tools are listed at https://prestodb.io/resources.html including:

Other tools & Libraries 

Presto Documentation

Presto’s documentation is maintained by the Presto Foundation and the current version is available here: https://prestodb.io/docs/current/ 

Releases can be found at: https://github.com/prestodb and the Release Notes are at: https://prestodb.io/docs/current/release.html 

Currently the documentation site has no search function. Use Google to search the docs instead – for example if you want to find the Presto documentation page for window functions just google site:prestodb.io window functions and click the first match.

Some of the most popular and useful documentation pages are listed below for convenience: 

In addition the documentation for Ahana Cloud for Presto – a fully integrated, cloud-native managed service built for AWS – can be found at https://docs.ahana.cloud/docs

That’s your summary of Presto documentation. If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS.

Presto Query Example

Since Presto is an ANSI SQL query engine, its SQL will be very familiar to practically anyone who has used a database, despite the fact Presto is not technically a database since it stores no data itself.

Starting with the SELECT statement, the full range of keywords that you would expect to use are supported including DISTINCT,  FROM, WHERE, GROUP BY, HAVING, UNION | INTERSECT | EXCEPT, ORDER B, LIMIT and so on.  More details can be found on the Presto SELECT doc page: https://prestodb.io/docs/current/sql/select.html 

Like most other database products, CTEs are also supported – Common Table Expressions allow you to define a temporary named result set that is available in the execution scope of a statement such as SELECT or INSERT.  Here’s the same query written without and then with a CTE:

--Without CTE
SELECT a, b
FROM (
  SELECT a, MAX(b) AS b FROM t GROUP BY a
) AS x;
 
--WITH CTE
WITH x AS (SELECT a, MAX(b) AS b FROM t GROUP BY a)
SELECT a, b FROM x;

Here’s a CTE example query:

presto> WITH my_time_data (the_time_now) as ( select now() ) 
SELECT the_time_now 
FROM my_time_data;
 
             the_time_now              
---------------------------------------
 2021-01-18 12:23:12.167 Europe/London 
(1 row)

The above Presto query example and documentation link should get you up and running writing Presto queries. If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS.

Presto Best Practices

We’ve put together a list of best practices for Presto, the open source distributed SQL engine, based on our experience. We hope you find these helpful.

  • Configure Presto’s coordinator and workers to run on separate instances/servers in production deployments. It is only recommended to have the coordinator and worker share the same instance for very small scale dev/test use. 
  • Always adjust Presto’s java memory configuration according to the available memory resources of your instance. There is no “default”, so the etc/jvm.config file on each node needs to be configured before your start Presto.  A useful rule of thumb is:  In each node’s jvm.config set -Xmx to 80% of the available physical memory initially, then adjust later based on your monitoring of the workloads. 
  • If using HDFS or S3 storage for example, consider using ORC format for your data files. There are numerous optimisations in Presto for ORC such as columnar reads (Presto is a columnar query engine), predicate pushdown, and  skipping reads of portions of files if the data is sorted (or mostly sorted) on the filtering columns.
  • Use partitioning. You can create a partitioned version of a table with a CTAS https://prestodb.io/docs/current/sql/create-table-as.html by adding  the partitioned_by clause to the CREATE TABLE.
  • Use bucketing.  Do this by adding the bucketed_by clause to your CREATE TABLE statement. You will also need to specify bucket_count. 
  • If you have a choice of metastore, choose Hive instead of Glue for your Presto cluster(s). Hive has some features that Glue does not, like column-level statistics and dynamic filtering which can boost query performance. The final decision will depend on your particular mix of systems and feature requirements.  Here’s a summary of the differences:

  • Collect table statistics to ensure the most efficient query plan is produced, which means queries run as fast as possible.  Use the sql ANALYZE TABLE <tablename>  command to do this. Repeat the ANALYZE TABLE commands for all tables involved in queries on a regular basis, typically when data has substantially changed (e.g. new data arrived  / after an ETL cycle has completed).
  • In conjunction with the above, if you are exploiting partitioning, make sure you update the partitioning information that’s stored in your metastore.  For example to sync the metastore with the partitioning in the table default.customer  use CALL system.sync_partition_metadata(‘default’, ‘customer’, ‘full’); Do this right after you create the table, and repeat this when new partitions are added.
  • Presto does not do automatic JOIN re-ordering by default. This only happens when the cost-based optimisation (CBO) feature is enabled (see below). By default you (or the application) need to make sure that smaller tables appear on the right side of the JOIN keyword. Remember: “LARGE LEFT” (put the large table on the left side of the join). 
  • If your queries are performing table joins you should try enabling the cost-based optimisation (CBO) feature – use: 
    • SET session join_distribution_type=’AUTOMATIC’;  
    • SET session join_reordering_strategy=’AUTOMATIC’; 

(both SETs are needed – and are persisted until session has ended/log out).  

  • You should enable Dynamic Filtering when 1 or more joins are in-play, especially if there’s a smaller dimension table being used to probe a larger fact table for example. Dynamic Filtering is pushed down to ORC and Parquet readers, and can accelerate queries on partitioned as well as non-partitioned tables. Dynamic Filtering is a join optimization intended to improve performance of Hash JOINs. Enable this with:
    • SET session enable_dynamic_filtering=TRUE;
  • If practical, try ordering/sorting your tables  during ingestion. This can greatly improve performance – especially the effectiveness of Dynamic Filtering. 
  • Monitor for Coordinator node overload. If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analysing, planning and optimising queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help.
  • Choose the right instances for your workers to ensure they have enough I/O.  Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 instances:

It’s a good idea to monitor the IO activity of your worker nodes to determine if there’s an IO bottleneck. 

  • Consider enabling Resource Groups. This is Presto’s workload manager and it’s used to place limits on resource usage, and can enforce queueing policies on queries that run within them or divide their resources among sub-groups. A query belongs to a single resource group, and consumes resources from that group (and its ancestors). A resource group represents the available Presto resources packaged together, with limits related to CPU, memory, concurrency, queueing, and priority.  Except for the limit on queued queries, when a resource group runs out of a resource it does not cause running queries to fail; instead new queries become queued. A resource group may have sub-groups or may accept queries, but may not do both. More details in the documentation: https://prestodb.io/docs/current/admin/resource-groups.html 

We hope you find these Presto best practices useful. If you’re interested in getting started with Presto, check out the Ahana Cloud platform, a managed service for Presto in AWS.

How Does Presto Join Data?

Because Presto is a distributed system composed of a coordinator and workers, each worker can connect to one or more data sources through corresponding connectors.

The coordinator receives the query from the client and optimises and plans the query execution, breaking it down into constituent parts, to produce the most efficient execution steps. The execution steps are sent to the workers which then use the connectors to submit tasks to the data sources. The tasks could be file reads, or SQL statements, and are optimised for the data source and the way in which the source organises its data, taking into account partitioning and indexing for example.

The data sources supported by Presto are numerous and can be an RDBMS, a noSQL DB, or Parquet/ORC files in an object store like S3 for example. The data sources execute the low level queries by scanning, performing filtering, partition pruning etc. and return the results back to the Presto workers. The join operation (and other processing) is performed by the workers on the received data, consolidated, and the joined result set is returned back to the coordinator.

You will notice Presto uses a “push model” which is different, for example, to Hive’s “pull model”. Presto pushes execution steps to the data sources, so some processing happens at the source, and some happens in Presto’s workers. The workers also communicate between each other, and the processing takes place in memory which makes it very efficient, suitable for interactive queries.  Hive on the other hand will read/pull a block of a data file, execute tasks, then wait for the next block, using the map reduce framework. Hive’s approach is not suitable for interactive queries since it is reading raw data from disk and storing intermediate data to disk, all using the framework MapReduce, which is better suited to long-running batch processing. This diagram compares Hive and Presto’s execution approaches:

How Presto Joins Relational and Non-Relational Sources

The next diagram shows some of Presto’s core Coordinator components, and the kinds of tasks   Presto’s workers handle. In this simplistic example there are two data sources being accessed; one Worker is scanning a Hive data source, the other worker is scanning a mongoDB data source. Remember Presto does not use Hive’s mapreduce query engine or HQL – the diagram’s “hive” worker means it is using the “hive connector” and the file system is the metastore information, and the raw source data is external to Presto, maybe in HDFS in Parquet or Orc format, for example.  The Worker dealing with mongo data is described as being on “the probe side” (in this example) whereby the mongo data is read, processed, normalized into columnar stores in Presto and then shuffled (or exchanged) across the cluster to the “builder” worker (the worker dealing with the hive data here) for the actual join to take place.

This is a simplistic example since in reality Presto is more sophisticated – the join operation could be running in parallel across multiple workers, with a final stage running on one node (since it cannot be parallelized). This final stage is represented by the third worker at the top of the diagram labeled  “Output”. This outputNode’s task is to stream out the result set back to the coordinator, and then back to the client.

Joins – Summary

It’s easy to see how Presto is the “Polyglot Data Access Layer” since it doesn’t matter where your data lives, any query can access any data, in-place, without ETL or data shipping or duplication. It is true federation. Even when blending very different sources of data, like JSON data in elasticsearch or mongodb with tables in a MySQL RDBMS, Presto takes care of the flattening and processing to provide a complete, unified view of your data corpus.

If you want to try out Presto, take a look at Ahana Cloud. It provides a managed service for Presto in AWS.

When I run a query with AWS Athena, I get the error message ‘query exhausted resources on this scale factor’. Why?

AWS Athena is well documented in having performance issues, both in terms of unpredictability and speed. Many users have pointed out that even relatively lightweight queries on Athena will fail. One part of the issue may be due to how many columns the user has in the Group By clause – even a small amount of columns (like less than 5 columns) will run into this issue of not having enough resources to complete. Other times it may be due to how much data is being parsed, and again even small amounts of data (like less than 200MB) will run into this issue of not having enough resources to complete.

Presto stores Group By columns in memory while it works to match rows with the same group by key. The more columns that are in the Group By clause, the fewer number of rows that will get collapsed with the aggregation. To address this problem, users will have to reduce the number of columns in the Group By clause and retry the query.

And still at other times, the issue may not be how long the query takes but if the query runs at all. Users that experience “internal errors” on queries one hour will re-run the same queries that triggered those errors and they will succeed.

Ultimately, AWS Athena is not predictable when it comes to query performance. That’s where Ahana Cloud, a managed service for Presto, can help. Ahana’s managed service for PrestoDB can help with some of the trade offs associated with a serverless service.

Some of the reasons you might want to try a managed service if you’re running into performance issues with AWS Athena:

  • You get full control of your deployment, including the number PrestoDB nodes in your deployment and the node instance-types for optimum price/performance. 
  • Consistent performance because you have full control of the deployment
  • Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, increased resilience, speed, and ease of use. 
  • No limits on queries

Plus you can use your existing metastore, so you don’t need to modify your existing architecture.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

I use RDS Postgres databases and need some complex queries done which tend to slow down my databases for everyone else on the system. What do I need to consider as I add a data lake for the analytics?

Background

Many medium-sized companies start out using one of the six flavors of Amazon RDS: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. As they grow they can have numerous instances of RDS with numerous databases for each customer, whether internal or external. When one customer tries a large analytic query, that workload can cause problems with the RDS cluster, perhaps making it to drop other workloads, fail, or slow down for others. As the needs for processing huge amounts of data increases, so does the need to take your analytics to the next level.

In addition to your operational databases, the idea is to have a much more open analytics stack where you have the ability to run different kinds of processing on the same data. A modern analytics stack lets your organization have much more insights without impacting your operational side. And doing that with open data formats is another key consideration.

Considerations

There’s a couple options for evolving your analytics stack. One would be to use a cloud data warehouse like Amazon Redshift or Snowflake. Another would be to use open formats in a data lake with a modern SQL query engine. In the first case, there are some advantages of having the highest performance possible on your data but it comes at a certain cost as well as an amount of lock-in, as you cannot easily get at the data in proprietary formats. Considering the data lake with query engine option, we believe that Presto is one of the best choices because of its performance, scalability, and flexibility to connect to S3-based data lakes and federate other data sources as well.

So our recommendation would be to run Presto on top of data stored in an open Parquet or ORC format in S3. Doing it this way, you can put other engines on top of the data as needed, do you’re not going to face a lot of rework in the future should you decide to change something.

From OLTP to Data Lake Analytics

The high level concept is to have an initial one-time bulk migration of your data in OLTP databases to get a copy moved in to S3. After that, as your operational databases will continue to generate or change data, you’ll need establish a pipeline, a stream, or a Change Data Capture (CDC) process in place to get those into S3. BTW, not often will you want data going back from S3 into your relational databases.

While there are different ways to pipe data into S3, one AWS recommended approach is to use the AWS Database Migration Service a.k.a. DMS(much in the same way you may have used it when you migrated off-prem). With AWS Database Migration Service, you can get the first one-time bulk load and then continuously replicate your data with high availability and stream data to Amazon S3. AWS DMS would run in the background and handle all of the data changes for you. You can pick the instance and what period of time you’d like it to run, for example, you may want it to run hourly partitions or daily partitions. That’ll depend on how fast your data is changing and what your requirements are for analytics as well.

Next you’ll want to install Presto on top, and for that you can build a Presto cluster yourself, or simply use Ahana Cloud for Presto to create pre-configured clusters in about 30 minutes.

It’s also worth noting that after you’ve ingested the data into S3 based on what you think the most optimized format or folder structure would be, you may find out that you you need it different. In that case, not to worry, you can use Presto itself to do data lake transformations as well, using SQL you can do a CTAS, Create Table As Select:

The CREATE TABLE AS SELECT (CTAS) statement is one of the most important SQL features available. CTAS is a parallel operation that creates a new table based on the output of a SELECT statement. CTAS is the simplest and fastest way to create and insert data into a table with a single command.

Summary

By adding a modern analytics stack to your operational databases, you evolve your analytics capabilities and deliver more insights for better decisions. We suggest moving to an open data lake analytics reference architecture with Presto. This allow a meta analysis of all your data, giving a look at broader trends across databases and other data sources.

Ahana Cofounders Make Data Predictions for 2021

Open Analytics, Open Source, Multi-Cloud and Federated, Disaggregated Stack Rise to the Top 

San Mateo, Calif. – January 6, 2021 Ahana’s Cofounder and Chief Product Officer, Dipti Borkar, and Cofounder and Chief Technology Officer, Dave Simmen predict major developments in cloud, data analytics, databases and data warehousing in 2021. 

As the shift to the cloud and multi-cloud environments has become even greater during the past year hastened by the challenges of the COVID-19 pandemic, new challenges have arisen when it comes to managing data and workloads. Companies want to keep their data secure in their own accounts and environments but still leverage the cloud for analytics. They want quick analytics on their data lakes. They also want to take advantage of containers for their applications across different cloud environments. 

Dipti Borkar, Co-founder and Chief Product Officer, outlines the major trends she sees on the horizon in 2021:

  • Open Source for Analytics & AINext year will see a rise in usage of analytic engines like Presto and Apache Spark for AI applications because of its open nature – open source license, open format, open interfaces, and open cloud. 
  • Open Source for Open Analytics More technology companies will adopt an open source approach for analytics compared to the proprietary formats and technology lock-in that came with the traditional data warehousing. This open analytics stack uses open source Presto as the core engine; open formats such as JSON, Apache ORC, Apache Parquet and others; open interfaces such as standard JDBC / ODBC drivers to connect to any reporting / dashboarding / notebook tool and ANSI SQL compatibility; and is open cloud. 
  • Containerizing Workloads for Multi-Cloud EnvironmentsAs the importance of a multi-cloud approach has gained traction over the past year, next year more companies will run container workloads in their multi-cloud environments. To do that, they’ll need their compute processing to be multi-cloud ready and containerized out of the box, so choosing a compute framework will become even more critical for these workloads. Engines like Presto, which are multi-cloud ready and container friendly, will become the core engine of these multi-cloud containerized workloads.
  • Less Complexity, More Kubernetes-ity for SaaSContainers provide scalability, portability, extensibility and availability advantages, but managing them is not seamless and, in fact, is often a headache. Kubernetes takes that pain away for building, delivering and scaling containerized apps. 2021 will bring more managed SaaS apps running on K8s, and those that are able to abstract the complexities of their platforms from users will emerge as the winners.
  • The New “In-VPC” Deployment ModelAs cloud adoption has become mainstream, companies are creating and storing the majority of their data in the cloud, especially in cost-efficient Amazon S3-based data lakes. To address data security concerns, these companies want to remain in their own Virtual Private Cloud (VPC). As a result, 2021 will bring in a new cloud-native architecture model for data-focused managed services, which I’m calling the “In-VPC” deployment model. It separates the control plane from the compute and data planes for better security and cleaner management.

Dave Simmen, Co-founder and Chief Technology Officer, outlines the major trends he sees on the horizon in 2021:

  • The Next Evolution of Analytics Brings a Federated, Disaggregated Stack – A federated, disaggregated stack that addresses the new realities of data is displacing the traditional data warehouse with its tightly coupled database. The next evolution of analytics foresees that a single database can no longer be the solution to support a wide range of analytics as data will be stored in both data lakes and a range of other databases. SQL analytics will be needed for querying both the data lake and other databases. We’ll see this new disaggregated stack become the dominant standard for analytics with SQL-based technologies like the Presto SQL query engine at the core, surrounded by notebooks like Jupyter and Zeppelin and BI tools like Tableau, PowerBI, and Looker.
  • SQL Is the New…SQL As companies shift their data infrastructure to a federated (one engine queries different sources), disaggregated (compute is separate from storage is separate from the data lake) stack, we’ll see traditional data warehousing and tightly coupled database architectures relegated to legacy workloads. But one thing will remain the same when it comes to this shift – SQL will continue to be the lingua franca for analytics. Data analysts, data engineers, data scientists and product managers along with their database admins will use SQL for analytics.

Tweet this: @AhanaIO announces 2021 #Data Predictions #cloud #opensource #analytics https://bit.ly/39j6HHl

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto for Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Ahana is a premier member of Linux Foundation’s Presto Foundation and actively contributes to the open source PrestoDB project. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

What is DB Presto Online?

If you are looking for online resources for Presto like the docs, then start here https://prestodb.io/ 

If you are looking for information on how to tell if your Presto cluster is online or offline then read on.

Presto Mac/Brew Install

If you installed Presto on your Mac using brew then the service is called presto-server and you manage it like this:

$ presto-server run

To tell if your presto server is online:

$ presto-server status
Not Running

Presto manual Install 

If you are running a regular deployment of open source Presto you start/stop/check status with:

$ bin/launcher stop
$ bin/launcher start
$ bin/launcher status
Running as 61824

To be sure your presto server is online point your browser at its host IP (e.g.  localhost:8080) and you should see the Presto Console’s “Cluster Overview” UI.

Advanced SQL Queries with Presto

Advanced SQL features and functions are used by analysts when, for example, complex calculations are needed, or when many tables (perhaps from multiple sources) need to be joined, when dealing with nested or repeated data, dealing with time-series data or complex data types like maps, arrays, structs and JSON, or perhaps a combination of all these things.  

Presto’s ANSI SQL engine supports numerous advanced functions which can be split into the following categories – links to the PrestoDB documentation are provided for convenience:

Running advanced SQL queries can benefit greatly from Presto’s distributed, in-memory processing architecture and cost-based optimizer. 

Presto Platform Overview

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes. Some of the world’s most innovative and data-driven companies like Facebook, Walmart and Netflix depend on Presto for querying data sets ranging from gigabytes to petabytes in size. Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.

Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Presto allows querying data where it lives, including Hive, Cassandra, relational databases, HDFS, object stores, or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization. Presto is an in-memory distributed, parallel system. 

Presto is targeted at analysts who expect response times ranging from sub-second to minutes. Presto breaks the false choice between having fast analytics using an expensive commercial solution or using a slow “free” solution that requires excessive hardware. A single Presto query can combine data from multiple sources. 

The Presto platform is composed of:

  • Two types of Presto servers: coordinators and workers. 
  • One or more connectors: Connectors link Presto to a data source such as Hive or a relational database. You can think of a connector the same way you think of a driver for a database. 
  • Cost Based Query Optimizer and Execution Engine. Parser. Planner. Scheduler.
  • Drivers for connecting tools, including JDBC. The Presto-cli tool. The Presto Console. 

In terms of organization the community owned and driven PrestoDB project is supported by the Presto Foundation, an independent nonprofit organization with open and neutral governance, hosted under the Linux Foundation®. Presto software is released under the Apache License 2.0.

Ahana offers a managed service for Presto in the cloud. You can get started for free today.

0 to Presto in 30 minutes with AWS and Ahana Cloud
Video & Slides

Video
Slides

How To Stop Presto

If you are using the presto-admin tool this is how to stop Presto safely:

$ presto-admin server stop

In addition these commands are also useful:

presto-admin server start
presto-admin server restart
presto-admin server status

If you are using Presto on EMR, you can EMR restart Presto with:

sudo stop presto
sudo start presto

(If you have changed any configuration params you should do this on every node where you made a change).

The above “How To Stop Presto” information is correct for PrestoDB. But other “hard forks” of Presto may use different methods to stop and start. 

Presto New Releases

Where is the latest release of PrestoDB? And where can I find the release notes? Where is the documentation? These are common questions with easy answers. Presto’s web site https://prestodb.io/ and GitHub have all the information you need on Presto new releases:

Releases: https://github.com/prestodb for the Presto and Presto-admin repositories.

Release Notes: https://prestodb.io/docs/current/release.html 

Documentation: https://prestodb.io/docs/current/ 

The above list has all the main resources you need for working with Presto’s New Releases. If you’re looking to get up and running quickly with Presto, Ahana Cloud is a SaaS for Presto platform. Presto can be complicated with over 200 parameters to configure and tune – Ahana Cloud takes care of all those; all you do is point and click to get your Presto cluster up and querying your data sources. You can sign up for a free trial and check it out at https://ahana.io/sign-up

How Much Memory Should I Give A Presto Worker Node?

Presto is an in-memory query engine and so naturally memory configuration and management is important. A common question that comes up is how much memory should I give a worker node?

JVM Memory

Presto’s JVM memory config nearly always needs to be configured – you shouldn’t be running Presto with its default setting of 16GB of memory per worker/coordinator. That’s the Presto max.

The Xmx flag specifies the maximum memory allocation pool for a Java virtual machine. Change the  -Xmx16G in the jvm.config file to a number based on your cluster’s capacity, and number of nodes.  See https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html on how to do this.  

Rule of Thumb

It is recommended you set aside 15-20% of total physical memory for the OS.  So for example if you are using EC2 “r5.xlarge” instances which have 32GB of memory,  32GB-20% = 25.6GB so you would use  -Xmx25G in the jvm.config file for coordinator and worker (or  -Xmx27G if you want to go with 15% for the OS).

This is assuming there are no other services running on the server/instance, so maximum memory can be given to Presto. 

Presto Memory 

Like with JVM above, there are two memory related settings that you should check before starting Presto. For most workloads Presto’s other memory settings will work perfectly well when left at their defaults. There are configurable parameters that control memory allocation that could be useful for specific workloads however.  The practical guidelines below will 1) help you decide if you need to change your Presto memory configuration, and 2) which parameters to change.

Workload Considerations

You may want to change Presto’s memory configuration to optimise ETL workloads versus analytical workloads, or for high query concurrency versus single-query scenarios.  There’s a great in-depth blog on Presto’s memory management written by one of Presto’s contributors at  https://prestodb.io/blog/2019/08/19/memory-tracking which will guide you in making more detailed tweaks. 

Configuration Files & Parameters 

When first deploying Presto there are two memory settings that need checking. Locate the config.properties files for both the coordinator and worker. There are two important parameters here: query.max-memory-per-node and query.max-memory.  Again see https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html for rules-of-thumb and how to configure these parameters based on the available memory. 

The above guidelines and links should help you when considering how much memory should you give a worker node. You can see there’s a lot of tuning and config’s to manage – over 200. To avoid all memory configuration work you can use Ahana Cloud for Presto – a fully managed service for Presto, that needs zero configuration. Sign up for a free trial at https://ahana.io/sign-up.

Ahana Announces General Availability of Managed Service for Presto on AWS; Delivers Combined Solution with Intel to Drive Adoption of Open Data Lakes Analytics

San Mateo, Calif. – December 9, 2020 Ahana, the self-service analytics company for Presto, announced today the General Availability of Ahana Cloud for Presto, the first cloud-native managed service focused on Presto on Amazon Web Services (AWS). Additionally, Ahana announced a go-to-market solution in collaboration with Intel via its participation in the Intel Disruptor Program to offer an Open Data Lake Analytics Accelerator Package for Ahana Cloud users that leverages Intel Optane on the cloud with AWS. 

Ahana Cloud for Presto is the only easy-to-use, cloud-native managed service for Presto and is deployed within the user’s AWS account, giving customers complete control and visibility of clusters and their data. In addition to the platform’s use of Amazon Elastic Kubernetes Services (Amazon EKS) and a Pay-As-You-Go (PAYGO) pricing model in AWS Marketplace, the new release includes enhanced manageability, security and integrations via AWS Marketplace.

 Ahana Cloud for Presto includes: 

  • Easy-to-use Ahana SaaS Console for creation, deployment and management of multiple Presto clusters within a user’s AWS account bringing the compute to user’s data 
  • Support for Amazon Simple Storage Service (Amazon S3), Amazon Relational Database (Amazon RDS) for MySQL, Amazon RDS for PostgreSQL and Amazon Elasticsearch 
  • Click-button integration for user-managed Hive Metastores and Amazon Glue
  • Built-in hosted Hive Metastore that manages metadata for data stored in Amazon S3 data lakes 
  • Pre-integrated and directly queryable Presto query log and integrations with Amazon CloudWatch 
  • Cloud-native, highly scalable and available containerized environment deployed on Amazon EKS

“With Ahana Cloud being generally available, the power of Presto is now accessible to any data team of any size and skill level. By abstracting away the complexities of deployment, configuration and management, platform teams can now deploy ‘self-service’ Presto for open data lake analytics as well as analytics on a range of other data sources,” said Dipti Borkar, Cofounder and Chief Product Officer, Ahana. “Users are looking for analytics without being locked-in to proprietary data warehouses. This offering brings a SaaS open source analytics option to users with Presto at its core, using open formats and open interfaces.”

“As Ahana Cloud users, we saw from day one the value the platform brings to our engineering team,” said Kian Sheik, Data Engineer, ReferralExchange. “Within about an hour we were up and running Presto queries on our data, without having to worry about Presto under the covers. With out-of-the-box integrations with a Hive data catalog and no configurations needed, Ahana Cloud takes care of the intricacies of the system, allowing our team to focus on deriving actionable insights on our data.”

Ahana also announced its participation in the Intel Disruptor Program to drive the adoption of Open Data Lake Analytics. Together, Ahana and Intel will offer an Open Data Lake Analytics Accelerator Package, available for Ahana Cloud users that leverage Intel Optane on AWS. It includes special incentives and PAYGO pricing. An Open Data Lake Analytics approach is a technology stack that includes open source, open formats, open interfaces, and open cloud, a preferred approach for companies that want to avoid proprietary formats and technology lock-in that come with traditional data warehouses. The offering is aimed at improving joint customers’ experience of running Presto in the cloud to help power the next generation of analytics use cases.

“We look forward to working with Ahana and helping bring this compelling open data lake analytic solution to market,” said Arijit Bandyopadhyay, CTO of enterprise analytics & AI within Intel’s data platform group. “As more companies require data to be queried across many different data sources like Amazon S3, Amazon Redshift and Amazon RDS, Presto will become even more mission critical. Intel Optane coupled with the Ahana Cloud platform provides superior analytical performance and ease of use for Presto, enabling data-driven companies to query data in place for open data lake analytics.”

Availability and Pricing
Ahana Cloud for Presto is available in AWS Marketplace, with support for Microsoft Azure and Google Cloud to be added in the future. Ahana Cloud for Presto is elastically priced based on usage, with PAYGO and annual options via AWS Marketplace, starting from $0.25 per Ahana cloud credit. 

The Ahana/Intel Open Data Lakes Accelerator Package is available today via an AWS Marketplace Private Offer.

Supporting Resources

Tweet this: .@AhanaIO announces GA of #ManagedService for #Presto; Joins Intel Disruptor Program #cloudnative #opensource #analytics @prestodb https://bit.ly/2LiE6cL

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto for Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Ahana is a premier member of Linux Foundation’s Presto Foundation and actively contributes to the open source PrestoDB project. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

How do I show tables from all schemas with Presto?

In Presto it is straightforward to show all tables in a schema e.g. If we have a MySQL data source/catalog that has a “demo” schema we use show tables in mysql.demo; but this only reveals the tables managed by that data source.

There is no equivalent way to show all tables in all schemas for a data source. However there’s the metastore to fall back on which we can query: In MySQL, Glue, Hive and others there is a schema called “information_schema” which contains a table called “tables”.  This maintains a list of  schemas and tables relating to that data source.

For a MySQL data source, here’s what Presto shows in a Presto table:

presto> select table_schema, table_name from mysql.information_schema.tables order by 1,2;
    table_schema    |                  table_name                   
--------------------+-----------------------------------------------
 demo               | geography                                     
 demo               | state                                         
 demo               | test                                          
 information_schema | applicable_roles                              
 information_schema | columns                                       
 information_schema | enabled_roles                                 
 information_schema | roles                                         
 information_schema | schemata                                      
 information_schema | table_privileges                              
 information_schema | tables                                        
 information_schema | views                                         
 sys                | host_summary                                  
 sys                | host_summary_by_file_io                       
 sys                | host_summary_by_file_io_type                  
...

For Ahana’s integrated Hive metastore:

presto:demo> select table_schema, table_name from ahana_hive.information_schema.tables order by 1,2;
    table_schema    |        table_name         
--------------------+---------------------------
 csv_test           | yellow_taxi_trips         
 csv_test           | yellow_taxi_trips_orc     
 csv_test           | yellow_taxi_trips_staging 
 information_schema | applicable_roles          
 information_schema | columns
...    

This should help you show tables from all schemas.

How do I convert Bigint to Timestamp with Presto?

UNIX timestamps are normally stored as doubles. If you have UNIX timestamps stored as big integers then you may encounter errors when trying to cast them as timestamps:

presto> select col1 from table_a;
     col1      
------------
 1606485526 
 1606485575 
 
presto> select cast(col1 as timestamp) from table_a;
Query 20201127_150824_00052_xnxra failed: line 1:8: Cannot cast bigint to timestamp

There is a solution!  Presto’s from_unixtime() function takes a UNIX timestamp and returns a timestamp:

presto> select col1,from_unixtime(col1) as ts from table_a;
    col1    |          ts          
------------+-------------------------
 1606485526 | 2020-11-27 13:58:46.000 
 1606485575 | 2020-11-27 13:59:35.000 

And we can optionally modify the format of the result by using date_format():

presto> select date_format(from_unixtime(col1),'%Y-%m-%d %h:%i%p') from table_a;
        _col0        
---------------------
 2020-11-27 01:58PM 
 2020-11-27 01:59PM 

That’s how to use from_unixtime() to convert a bigint to timestamp. 

How do I convert timestamp to date with Presto?

Luckily Presto has a wide range of conversion functions and they are listed in the docs.  Many of these allow us to specifically convert a timestamp type to a date type.

To test this out we can use Presto’s built-in current_timestamp function (an alias for the now() function) that returns the current system time as a timestamp:

presto> select current_timestamp as "Date & Time Here Now";
         Date & Time Here Now          
---------------------------------------
 2020-11-27 13:20:04.093 Europe/London 
(1 row)

To grab the date part of a timestamp we can simply cast to a date:

presto> select cast(current_timestamp as date) as "Today's date";
 Today's date 
--------------
 2020-11-27   
(1 row)

Or we can use date() which is an alias for cast(x as date):

presto> select date(current_timestamp) as "Today's date";
 Today's date 
--------------
 2020-11-27   
(1 row)

We can use date_format() which is one of Presto’s MySQL-compatible functions: 

presto:demo> select date_format(current_timestamp, '%Y%m%d') as "Today's date";
 Today's date  
----------
 20201127 
(1 row)

Finally we can use format_datetime() which uses a format string compatible with JodaTime’s DateTimeFormat pattern format:

presto:demo> select format_datetime(current_timestamp, 'Y-M-d') as "Date";
  Date  
----------
 2020-11-27 
(1 row)

The above 5 examples should allow you to convert timestamps to dates in any scenario.

Common Questions/Errors:

Why am I getting the following error: ‘>’ cannot be applied to date, varchar (10)

Answer: You are trying to compare a string literal in your query to a date type in your schema. The operator in the error can be any operator >,=,<,>=,<=. To fix that, you need to cast the string to a date type or use a function to parse the string to date.

Example error: select now() as time_now where current_date = '2021-06-01'

Example solution:

Option 1: Cast 
select now() as time_now where current_date > date '2021-06-01'

Option 2: Cast Explicit
select now() as time_now where current_date > Cast ( '2021-06-01' as Date)

Option 3: Parse iso8601 date
select now() as time_now where current_date > from_iso8601_date('2021-06-01')

Option 4: MySQL parser, Non Standard Date format using date_parse 
select now() as time_now where current_date > date_parse('01-06-2021','%d-%m-%Y')

Option 5: Java DataTime for parser, Non Standard Date format using date_parse 
select now() as time_now where current_date > parse_datetime('01-06-2021','dd-MM-YYYY')

How do I convert string with timestamp to a date?

Answer: You are trying to accurately convert a string of the format such as ‘2021-05-13T19:02:57.415006+01:00’ to date and use it in your query for comparison. You need to parse the string using either date_parse or parse_datetime

Example problem: Need to convert the following timestamp to date: ‘2021-05-13T19:02:57.415006+01:00’

Example solution:

Option 1: MySQL parser, Non Standard Date format using 
select date_parse('2021-05-13T19:02:57.415006','%Y-%m-%dT%h:%i:%s.%f') AT TIME ZONE '+01:00'

Option 2: Java DateTime format parser, Non Standard Date format using 
select parse_datetime('2021-05-13T19:02:57.415006+01:00','YYYY-MM-dd''T''HH:mm:ss.SSSSSSZ') 

How do I configure Case Sensitive Search with Presto?

When dealing with character data, case sensitivity can be important when  searching for specific matches or patterns. But not all databases and query engines behave in the same way. Some are case insensitive by default, some are not. How do we configure things so they behave in the way we want?

Here’s an example of why we might need to take steps to control case sensitivity. We’re accessing a MySQL database directly:

mysql> select * from state where name='iowa';
+------+----+--------------+
| name | id | abbreviation |
+------+----+--------------+
| Iowa | 19 | IA           |
+------+----+--------------+
1 row in set (0.00 sec)

MySQL is case-insensitive by default. Even though the MySQL column contains the capitalized string ‘Iowa’ it still matched the query’s restriction of ‘iowa’.  This may be acceptable, but in some use cases it could lead to unexpected results.

Using Presto to access the same MySQL data source things behave differently, and arguably, in a more expected way:

presto:demo> select * from state where name='iowa';
 name | id | abbreviation 
------+----+--------------
(0 rows)
 
Query 20201120_151345_00001_wjx6r, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:07 [1 rows, 0B] [0 rows/s, 0B/s]
 
presto:demo> select * from state where name='Iowa';
 name | id | abbreviation 
------+----+--------------
 Iowa | 19 | IA           
(1 row)

Now we only get a match with ‘Iowa’, and not with ‘iowa’. Presto has essentially made this data source (MySQL) case sensitive, even though it is exactly the same database in both the above examples, with default configurations used.

Reconfigure Case Sensitivity 

With a RDBMS like MySQL you can configure the collation setting to control if you want case sensitivity or not.  You can set the collation at the database creation or table creation level as a part of the CREATE statement. Or you can use ALTER to change the collation of a database, table or individual column. This is described in MySQL’s documentation. 

But how do you change Presto to be case-insensitive? Presto does not support collation, because it is not a database and doesn’t store data. And there is no configuration parameter that controls this.  To manage case sensitivity in Presto, and mimic collation, we rewrite the query to force case insensitivity explicitly by using:

  • simple lower() or upper() functions

Example:

select * from state where lower(name)='california';
    name      
------------
 california 
 California 
 CALIFORNIA 
(3 rows)

This query has matched any upper/lower case combination in the table, mimicking case insensitivity.

Or regular expressions. 

Example:

select * from state where regexp_like(name, '(?i)california');
    name      
------------
 california 
 California 
 CALIFORNIA 
(3 rows)

The regular expression syntax (?i) means matches are case insensitive. 

When it comes to Case Sensitive Search Configuration you are now an eXpErT.

When should I use ORC versus Parquet when using Presto?

If you’re working with open data lakes using open source and open formats, you can have multiple formats. Presto works with both – ORC Presto and Parquet Presto. You’ll probably want to optimize for your workloads. 

Both ORC and Parquet store data in columns. For Presto Parquet, it is most efficient when it comes to storage and performance. ORC on the other hand is ideal for storing compact data and skipping over irrelevant data without complex or manually maintained indices. For example, ORC is typically better suited for dimension tables which are slightly smaller while Parquet works better for the fact tables, which are much bigger.

If you’re looking to get up and running quickly with Presto, you can check out Ahana Cloud. It’s a SaaS for Presto and takes care of all the configuration, tuning, deployment, etc.

What’s the advantage of having your own Hive metastore with Presto? How does it compare to Amazon Glue?

First let’s define what Apache Hive is versus Amazon Glue. Apache Hive reads, writes, and manages large datasets using SQL. Hive was built for Hadoop. AWS Glue is a fully managed ETL service for preparing and loading data for analytics. It automates ETL and handles the schemas and transformations. AWS Glue is serverless, so there’s no infrastructure needed to provision or manage it (you only pay for the resources used while your jobs are running).

Presto isn’t a database and does not come with a catalog, so you’d want to use Hive to read/write/manage your datasets. Presto abstracts a catalog like Hive underneath it. You can use the Glue catalog as the default Hive metastore for Presto.

With Ahana Cloud, you don’t really need to worry about integrating Hive and/or AWS Glue with Presto. Presto clusters created with Ahana come with a managed Hive metastore and pre-integrated Amazon S3 data lake bucket. Ahana takes care connecting external catalogs like Hive and Amazon Glue, so you can focus more on analytics and less on integrating your catalogs manually. You can also create managed tables as opposed to external tables.

How do you find out data type of value with Presto?

Presto has a typeof() function to make finding Presto data types of a value easy. This is particularly useful when you are getting values from nested maps for example and the data types need to be determined.

Here’s a simple example showing the type returned by Presto’s now() function:

presto:default> select now() as System_Timestamp, typeof( now() ) as "the datatype is";
           System_Timestamp            |     the datatype is      
---------------------------------------+--------------------------
 2020-11-18 15:15:09.872 Europe/London | timestamp with time zone 

Some more examples:

presto:default> select typeof( 'abc' ) as "the datatype is";
 the datatype is 
-----------------
 varchar(3)      
 
presto:default> select typeof( 42 ) as "the datatype is";
 the datatype is 
-----------------
 integer         
 
presto:default> select typeof( 9999999999 ) as "the datatype is";
 the datatype is 
-----------------
 bigint          
 
presto:default> select typeof( 3.14159 ) as "the datatype is";
 the datatype is 
-----------------
 decimal(6,5)

Armed with this info you should now be able to find out the data types of values.

How do you rotate rows to columns with Presto?

Sometimes called pivoting, here is one example of how to rotate row data with Presto.  

Suppose we have rows of data like this:

'a', 9
'b', 8
'a', 7 

We want to pivot this data so that all the ‘a’ row values are arranged in one column, and all the ‘b’ row values are in a second column like this:

ab
9
8
7

To rotate from rows to columns we will add an id to make aggregation easy. We will name the output columns a and b for the Presto key, and we’ll include the id in our result set. This is how we do the rotation in Presto, using VALUES() to supply the test data, and simple conditional CASE WHEN END logic:

presto:default> SELECT id
, MAX(CASE WHEN key = 'a' THEN value END) AS a
, MAX(CASE WHEN key = 'b' THEN value END) AS b 
FROM (VALUES (1, 'a', 9), (2, 'b', 8), (3, 'a', 7 )) as test_data (id, key, value) 
GROUP BY id ORDER BY id;
 
 id |  a   |  b   
----+------+------
  1 |    9 | NULL 
  2 | NULL |    8 
  3 |    7 | NULL 
(3 rows)

There are other SQL options for transforming (pivoting) rows into columns – you can use the map_agg function for example.

The code sample and description here should help when you need to rotate data from rows to columns using Presto.

If you want to get up and running with Presto, Ahana Cloud’s cloud managed service built for AWS is the easiest way to do that. See our docs for more details.

How do you rotate columns to rows with Presto?

Sometimes called unpivoting, here is one example of how to rotate column data with Presto.  

Suppose we have some integer data arranged in two columns called a and b:

ab
9
8
7

We want to rotate the columns into rows like this, where for every ‘a’ column value we now see a row labeled ‘a’, and ditto for the b values:

'a', 9
'b', 8
'a', 7

To rotate from columns to rows in Presto we’ll use a CTE and VALUES() to supply the test data, and simple conditional CASE WHEN END logic coupled with a sub-select and a UNION:

presto:default> with testdata(value_a, value_b) as ( VALUES (9,null), (null,8), (7,null) ) 
select key, value from 
(select 'a' as key, value_a as value 
from testdata 
UNION select 'b' as key, value_b as value 
from testdata) 
where value is not null;
 
 key | value 
-----+-------
 a   |     9 
 b   |     8 
 a   |     7 
(3 rows)

There are other SQL options for rotating (unpivoting) columns into rows: The CROSS JOIN unnest function is similar to LATERAL VIEW explode function.

The code sample and description here should help when you need to rotate data from columns to rows using Presto.

What are the operational benefits of using a managed service for Presto with Ahana Cloud?

First let’s hear from an AWS Solution Architect: “Ahana Cloud uses the best practices of both a SaaS provider and somebody who would build it themselves on-premises. So the advantage with the Ahana Cloud is that Ahana is really doing all the heavy lifting, and really making it a fully managed service, the customer of Ahana does not have to do a lot of work, everything is spun up through cloud formation scripts that uses Amazon EKS, which is our Kubernetes Container Service. The customer really doesn’t have to worry about that. It’s all under the covers that runs in the background. There’s no active management required of Kubernetes or EKS. And then everything is deployed within your VPC. So the VPC is the logical and the security boundary within your account. And you can control all the egress and ingress into that VPC. So you, as the customer, have full control and the biggest advantage is that you’re not moving your data. So unlike some SaaS partners, where you’re required to push that data or cache that data on their side in their account, with the Ahana Cloud, your data never leaves your account, so your data remains local to your location. Now, obviously, with federated queries, you can also query data that’s outside of AWS. But for data that resides on AWS, you don’t have to push that to your SaaS provider.”

Now that you have that context, lets get more specific, let’s say you want to create a cluster initially, it’s a just a couple of clicks with Ahana Cloud. You can pick the the coordinator instance type and the Hive metastore instance type and it is all flexible. Instead of using the Ahana-provided Hive metastore, you can bring your own Amazon Glue catalog. Then of course its easy to add data sources. For that, you can add in JDBC endpoints for your databases. Ahana has those integrated in and then Ahana Cloud automatically restarts the cluster.

Compared to EMR or if you’re running with other distributions, all of this has to be done manually:

  • you have to create a catalog properties file for each data source
  • restart the cluster on your own
  • scale the cluster manually
  • add your own query logs and statistic
  • rebuild everything when you stop and restart clusters

With Ahana, all of this manual complexity is taken away. For scaling up, if you want to grow the analytics jobs over time, you can add nodes seamlessly. Ahana Cloud and other distributions can add the nodes to the cluster while your services are still up and running. But the part that isn’t seamless is when you stop the entire cluster. In addition to all the workers and the coordinator being provisioned, the configuration and the cluster connections to the data sources, and the Hive metastore are all maintained with Ahana Cloud. And so when you restart the cluster back up, all of that comes up pre-integrated with the click of a button: the nodes get provisioned again, and you have access to that same cluster to continue your analytics service. This is very important, because otherwise, you would have to manage it on your own, including the configuration management and reconfiguration of the catalog services. Specifically for EMR, for example, when you terminate a cluster, you lose track of that cluster altogether. You have to start from scratch and reintegrate the whole system.

How do you do a Lateral View Explode in Presto?

Hive’s explode() function takes an array (or a map) as input and outputs the elements of the array (map) as separate rows. Explode is a built-in Table-Generating Function (UDTF) in hive and can be used in a SELECT expression list and as a part of LATERAL VIEW.

The explode function doesn’t exist in Presto; instead we can use Presto’s similar UNNEST. 

Here’s an example using test results data in json form as input, from which we compute the average score per student.  We use the WITH clause to define a common table expression (CTE) named example with a column alias name of data. The VALUES function returns a table rowset. 

WITH example(data) as 
(
    VALUES
    (json '{"result":[{"name":"Jarret","score":"90"},{"name":"Blanche","score":"95"}]}'),
    (json '{"result":[{"name":"Blanche","score":"76"},{"name":"Jarret","score":"88"}]}')
)
SELECT n.name as "Student Name", avg(n.score) as "Average Score"
FROM example
CROSS JOIN
    UNNEST ( 
        CAST (JSON_EXTRACT(data, '$.result')
        as ARRAY(ROW(name VARCHAR, score INTEGER )))
    ) as n
--WHERE n.name='Jarret'
GROUP BY n.name;

Student Name | Average Score 
-------------+---------------
 Jarret      |          89.0 
 Blanche     |          85.5 
(2 rows)

The UNNEST function takes an array within a column of a single row and returns the elements of the array as multiple rows.

CAST converts the JSON type to an ARRAY type which UNNEST requires.

JSON_EXTRACT uses a jsonPath expression to return the array value of the result key in the data.

This code sample and description should help when you need to to do a lateral view explode in Presto.

If you want to get up and running with Presto, Ahana Cloud’s cloud managed service built for AWS is the easiest way to do that. See our docs for more details.

How do you cross join unnest a JSON array?

Let’s take a look at how you can cross join unnest a JSON array by doing Presto joins. Here’s an example using test results data in json form as input, from which we compute the average score per student.  We use the WITH clause to define a common table expression (CTE) named example with a column alias name of data. The VALUES function returns a table rowset. 

WITH example(data) as 
(
    VALUES
    (json '{"result":[{"name":"Jarret","score":"90"},{"name":"Blanche","score":"95"}]}'),
    (json '{"result":[{"name":"Blanche","score":"76"},{"name":"Jarret","score":"88"}]}')
)
SELECT n.name as "Student Name", avg(n.score) as "Average Score"
FROM example
CROSS JOIN
    UNNEST ( 
        CAST (JSON_EXTRACT(data, '$.result')
        as ARRAY(ROW(name VARCHAR, score INTEGER )))
    ) as n
--WHERE n.name='Jarret'
GROUP BY n.name;

Student Name | Average Score 
-------------+---------------
 Jarret      |          89.0 
 Blanche     |          85.5 
(2 rows)

The UNNEST function takes an array within a column of a single row and returns the elements of the array as multiple rows.

CAST converts the JSON type to an ARRAY type which UNNEST requires.

JSON_EXTRACT uses a jsonPath expression to return the array value of the result key in the data.

The UNNEST approach is similar to Hive’s explode function.

This code sample and description should help when you need to execute a cross join to unnest a JSON array. 

How can you write the output of queries to S3 easily?

With Ahana Cloud, we’ve made it easy for you write output of queries to S3. While there’s a variety of formats, here’s an example:

presto> CREATE SCHEMA ahana_hive.s3_write WITH (location = 's3a://parquet-test-bucket/');
CREATE SCHEMA

presto> CREATE TABLE ahana_hive.s3_write.my_table
WITH (format = 'PARQUET')
AS SELECT <your query here> ;

Does Amazon Athena do joins across other data sources besides S3? Does Amazon Athena connect to other data sources?

With Amazon Athena you’re limited in scope when it comes to doing joins across other data sources like relational data systems and more. You have to set up a Lambda, which then connects with your database in the back, which is an additional piece you have to manage on your own.


With Ahana Cloud, you can directly connect to many different types of data sources including S3, MySQL, PostgreSQL, Redshift, Elastic, and more with a few clicks, and you’re ready to integrate with any cluster.

If I have catalogs connected and configurations attached to my Presto cluster, what happens when I take the cluster down?

If you’re managing Presto on your own, either through your own installation or through a service like AWS EMR or AWS Athena, you have to maintain and manage all of the catalogs and configurations attached to your cluster. That means that if you take your cluster down, you’ll lose those catalogs and configurations – they are not maintained for your cluster.


You can use the Ahana Cloud managed service for Presto to help with this. Ahana Cloud manages all of that for you, so you don’t have to worry about losing catalogs and configurations attached to your Presto cluster. You also get Presto bundled with data sources like the Hive metastore, Apache Superset, and more.

Optimize Presto EMR

What is Amazon EMR?

Amazon Elastic MapReduce (EMR) simplifies running big data and analytics frameworks like Presto for scalable compute in the cloud. It provides on-demand, scalable Hadoop clusters for processing large data sets. You can move large volumes of data into and out of AWS datastores like S3 with Amazon EMR. AWS EMR uses Amazon EC2 instances for fast provisioning, scalability and high availability of compute power. 

With EMR, users can spin up Hadoop clusters and start processing data in minutes, without having to manage the configuration and tuning of each cluster node required for an on-premises Hadoop installation. Once the analysis is complete, clusters can be terminated instantly, saving on the cost of compute resources.

As a Hadoop distribution, AWS EMR incorporates various Hadoop tools, including Presto EMR, Spark and Hive, so that users can query and analyze their data. With AWS EMR, data can be accessed directly from AWS S3 storage using EMRFS (Elastic MapReduce File System) or copied into HDFS (Hadoop Distributed File System) on each cluster instance for the lifetime of the cluster. In order to persist data stored in HDFS, it must be manually copied to S3 before the cluster is terminated.

What is Presto?

Presto is an open source, federated SQL query engine, optimized for running interactive queries on large data sets and across multiple sources. It runs on a cluster of machines and enables interactive, ad hoc analytics on large amounts of data. 

Presto enables querying data where it lives, including Hive, AWS S3, Hadoop, Cassandra, relational databases, NoSQL databases, or even proprietary data stores. It’s not really about Presto vs Hive EMR, more how to use it together. Presto allows users to access data from multiple sources, allowing for analytics across an entire organization. A typical combination might be Hadoop Presto for analytics.

Using Presto on Amazon EMR

With Presto and AWS EMR, you can optimize Presto on EMR and run interactive queries on large data sets with minimal setup time. AWS EMR handles the provisioning, configuration and tuning of Hadoop clusters. You can optimize a Presto query on EMR as well. Providing you launch a cluster with Amazon EMR 5.0.0 or later, Presto Amazon EMR is included automatically as part of the cluster software. Earlier versions of AWS EMR include Presto as a sandbox application.

AWS EMR And Presto Configurations

For Amazon EMR Presto configuration, as a query engine Presto does not manage storage of the data to be processed; it simply connects to the relevant data source in order to run interactive queries. In AWS EMR, data is either copied to HDFS on each cluster instance or read from S3. For EMR Presto optimization, with EMR 5.12.0 onwards by default EMR Presto versions uses EMRFS to connect to Amazon S3. EMRFS extends the HDFS API to S3, giving Hadoop applications, like Presto, access to data stored in S3 without additional configuration or copying of the data. For earlier versions of AWS EMR, data in S3 can be accessed using Presto’s Hive connector.

Real world applications

Jampp is a mobile app marketing platform that uses programmatic ads to acquire new users and retarget those users with relevant ads. It sits between advertisers and their audiences, so real time bidding of media advertising space is critical for their business. The amount of data Jampp generates as part of the bidding cycle is massive – 1.7B events are tracked per day, 550K/sec requests are received, and 100TB of data is processed by AWS elastic load balancers per day. PrestoDB plays a critical role in their data infrastructure. Jampp relies on AWS EMR Presto for their ad hoc queries and performs over 3K ad hoc queries/day on over 600TB of queryable data.

Ahana Welcomes Database Pioneer David E. Simmen to Executive Team as Chief Technology Officer

Former Splunk Fellow and Chief Architect & Teradata Fellow brings 30+ years of database expertise to lead technology innovation

San Mateo, CA — July 23, 2020Ahana, the SQL analytics company for Presto, today announced the appointment of David E. Simmen as Chief Technology Officer. Simmen will oversee the company’s technology strategy while driving product innovation. His extensive database experience will further accelerate Ahana’s vision of simplifying ad hoc analytics for organizations of all shapes and sizes.

“Dave has a remarkable record of innovation in distributed databases, data federation and advanced analytics. That, coupled with his over 30 years of architecting and building in the field, is nothing short of impressive,” said Steven Mih, Co-founder & CEO of Ahana. “His experience innovating distributed database technology with Splunk, Teradata and IBM provides Ahana the depth needed for truly groundbreaking development to achieve our vision of unified, federated analytics. I couldn’t be more thrilled to have Dave at the helm as CTO and Co-founder.”

Simmen joined Ahana most recently from Apple where he engineered iCloud database services. Prior to Apple, he was Chief Architect with Splunk and named the first Fellow in the company’s history. There, he built Splunk’s multi-year technical roadmap and architecture in support of enterprise customers. Prior to Splunk, Simmen was Engineering Fellow and CTO of Teradata Aster where he set the technical direction and led the architecture team that built advanced analytics engines and storage capabilities in support of big data discovery applications.

Earlier in his career, Simmen was a Senior Technical Staff Member (STSM) at IBM Research adding innovations to DB2 like the IBM Starburst SQL compiler, support for MPP and SMP, database federation, and many more. Simmen is also a named inventor on 37 U.S. patents and has 15 publications to his name including Fundamental Techniques for Order Optimization, Robust Query Processing through Progressive Optimization, and Accelerating Big Data Analytics With Collaborative Planning.

“Given the spread of data across data lakes, data warehouses, and the plethora of relational and non-relational data services being offered today, the need for unified, federated analytics is only increasing. Presto is one of the most popular federated query engines but it’s missing fundamental optimizations developed over the decades by academic researchers and industry pioneers,” said Simmen. “That needs to be applied to Presto, making it even more powerful to eventually become the heart of the unified, federated analytical stack. I look forward to working with the Ahana team to turn this vision into reality.”

“In my time with Dave at IBM’s Almaden Research Center and then with the DB2 optimizer team at Silicon Valley Laboratory, I saw firsthand his technical expertise and his broader leadership skills,” said Dr. Laura Haas, Dean of College of Information and Computer Sciences at UMass Amherst and retired IBM Fellow. Dean Haas is best known for her work on the Starburst query processor and Garlic, a federated engine. 

Dean Haas continued, “His combined leadership and technical depth is unparalleled, and it’s wonderful that he has joined Ahana to drive innovation for the next generation of data federation technology for analytics. I look forward to seeing his vision come to life in this space.”

Resources:

Download a headshot of David E. Simmen.

Links to David Simmen’s published research papers 

Tweet this:  .@AhanaIO welcomes Database Pioneer David E. Simmen to Executive Team as #CTO #database #opensource #analytics #cloud bit.ly/30xmwWr

About Ahana

Ahana, the SQL analytics company for Presto, is focused on evangelizing the Presto community and bringing simplified ad hoc analytics offerings to market. As the Presto market continues to grow exponentially, Ahana’s mission is to enable unified, federated analytics. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV and Leslie Ventures. Follow Ahana on LinkedIn and Twitter.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

beth@ahana.io

Ahana Announces Linux Foundation’s PrestoDB Now Available on AWS Marketplace and DockerHub

Ahana to provide community and commercial support for PrestoDB users 

San Mateo, CA — June 30, 2020Ahana, the Presto-based analytics company, announced today the availability of the completely open source PrestoDB Amazon Machine Image (AMI) on the AWS Marketplace as well as a PrestoDB container on DockerHub. The PrestoDB AMI is the first and only completely open source and completely free Presto offering available on the AWS Marketplace. These free PrestoDB offerings make it significantly easier for data platform teams to get started with Presto in the cloud particularly for interactive, ad hoc analytics on S3 data lakes and other popular data sources like AWS RDS, Redshift, Amazon’s Elasticsearch service and others. 

Additionally, Ahana will provide commercial support for users who require technical help and management of their container or AMI-based PrestoDB clusters.

PrestoDB is a federated SQL engine for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data, which continues to grow exponentially across a wide range of data lakes and databases. As a result, data platform teams are increasingly using Presto as the de facto SQL query engine to run analytics across data sources in-place, without the need to move data. One of the fastest growing projects in the data analytics space, PrestoDB is hosted by the Linux Foundation’s Presto Foundation and is the same project running at massive scale at Facebook, Uber and Twitter. 

“We’re looking forward to making it easier for the Presto community to get started with Presto,” said Dipti Borkar, co-founder and Chief Product Officer at Ahana and Outreach Committee Chairwoman of the Presto Foundation. “This is the first and only fully open source version of Presto that’s available on the AWS Marketplace, and we hope to see continued growth and adoption of PrestoDB as the federated query engine of choice for analytics across data lakes and database.”

In addition to the PrestoDB AMI and container, new Presto Sandbox offerings are also available to users getting started with Presto. These include a new Sandbox AMI on the AWS Marketplace and a Sandbox container on DockerHub. They come preconfigured with the Hive Metastore catalog to allow users to query data easily from S3, MySQL and other data sources as well as with built-in data sets like the TPC-DS benchmark data.

The new AWS AMIs and container offerings are yet another way for PrestoDB users to install and get started with the software. Other options include manual installation and using Homebrew. All variations offer PrestoDB as a completely open source technology.

“Having tracked the data space for more than two decades, I’ve seen wave after wave of innovation. What Presto brings to the table is something we’ve long sought: an efficient engine for federated queries across widely disparate datasets,” noted Eric Kavanagh, CEO of The Bloor Group. “This allows organizations to retain their investments in databases, data warehouses, data lakes, and so-called lake houses, while expediting and amplifying business value. The approach Ahana is taking around making PrestoDB easier to use and providing commercial support are key pieces in helping widen the adoption of the technology.” 

Two Presto offerings, one name

Despite similar names, PrestoDB and PrestoSQL are two different offerings. While other variations of Presto are available on the marketplace like the Starburst Data AMI (based on PrestoSQL), they are paid offerings with proprietary features. The PrestoDB AMI, on the other hand, is 100% open source and available for use in production immediately. 

As the original project that came out of Facebook in 2013, PrestoDB is hosted under the auspices of the Linux Foundation’s Presto Foundation. The PrestoSQL fork was announced in 2019 and is backed by the Presto Software Foundation, led by the original creators of Presto who left Facebook.

Resources

Tweet this:  .@AhanaIO announces @prestodb now available on @awsmarketplace; Ahana will provide support #opensource #analytics #cloud https://bit.ly/2BGPudo

About Ahana

Ahana, the Presto-based analytics company, is the only company focused on unifying the PrestoDB community and bringing PrestoDB-focused ad-hoc analytics offerings to market. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # #

Media Contact:

Beth Winkowski

Winkowski Public Relations, LLC

978-649-7189

bethwinkowski@gmail.com