Presto Guide, Documentation & Support

Presto is a fast, flexible distributed SQL engine created and used by Facebook at scale. O’Reilly’s Learning and Operating Presto ebook is an approachable guide for using Presto and getting Presto into a production deployment. It is intended to make Presto more accessible to anyone who cares about their organization becoming data-driven. It will explain how to begin Presto operations at your organization to derive insights on datasets of any type and any size, wherever they reside. 

Regardless of whether you’re a current user of Presto, new to Presto, or going to provide Presto to your users, the O’Reilly Learning and Operating Presto guide book does an excellent job of explaining what you need to know.  By reading this Presto guide you will learn what Presto is, where it came from, and how it differs from other data warehousing solutions. You’ll discover why Facebook, Uber, Twitter, and cloud providers use Presto and how you can quickly deploy Presto in production. You’ll hear about who uses Presto and the ways they solve data challenges with Presto. 

You can supplement what you learn in this Presto guide by visiting the Presto website, reading the Ahana blog, and participating in Presto’s growing community, which is filled with experienced users, developers, and data architects. You can join the Slack channel, engage via Github Discussions, attend the Presto Virtual Meetups.  All these avenues will help guide you through Presto.

Check if Presto map or array is empty

When working with array (indexable lists) or map (key-value tuple) complex types in Presto, it is useful to be able to test if they are empty or not. The examples below should enable you to check if a map or array is empty.

Maps

First, let’s create a map from an array of key values:

presto> select map(array['taco','donut'], array[1,2]) as myMap;
       myMap       
-------------------
 {donut=2, taco=1} 

We can use the CARDINALITY() function to check if the map contains anything:

presto> select cardinality(map(array['taco','donut'], array[1,2])) = 0 as is_empty;
 is_empty 
----------
 false    

presto> select cardinality(map(array[], array[])) = 0 as is_empty;
 is_empty 
----------
 true     

We can also compare a map with an empty map() to test if it is empty:

presto> select (map(array['taco'], array['donut']) = map()) as is_empty;
 is_empty 
----------
 false 

Arrays

Again, we can use the CARDINALITY() function to check if an array contains anything. Here’s an example using an array with 3 elements:

presto> SELECT cardinality(ARRAY['Ahana', 'Cloud', 'Presto']) = 0 as is_empty;
is_empty 
--------
 false  


presto> SELECT cardinality(ARRAY[]) = 0 as is_empty;
is_empty 
--------
 true  

 Another method to check if an array is empty is to compare it with array[] like this:

presto> SELECT (map_keys(map(ARRAY['Ahana'],ARRAY['Presto']))= array[]) as is_empty;
 is_empty 
----------
 false 

Tip: Be aware of null values. If an array contains nulls it is not considered to be empty:

presto> SELECT cardinality(ARRAY['', '', '']) = 0 as is_empty;
is_empty 
--------
 false  

presto> SELECT cardinality(ARRAY[null,null,null]) = 0 as is_empty;
is_empty 
--------
 false  

The above query examples should enable you to check if a map or array is empty.

Export result of Select statement

A common question is “how can I run a query and export the result set quickly and easily in Presto?”  You are in luck, as several solutions exist – your choice will depend on your preferred tool, the output format you want, and the size of the result-set. Here are two options.

Using Presto-CLI

If you are using the Presto command line tool presto-cli (or just presto on the Mac if you used brew install presto) then use the --output-format parameter and redirect the output to a regular file. For example:

$ presto --server https://myPrestoCluster.com \
--catalog glue --schema amazon \
--execute "select product_title, count(*) as Num_Reviews from reviews where lower(product_title) like '%tacos%' group by 1 order by 2 desc limit 20" \
--output-format CSV > top20_taco_reviews.csv

$ head top20_taco_reviews.csv
"Dragons Love Tacos","832"
"Vitacost Extra Virgin Certified Organic Coconut Oil 16 Oz","281"
"Vitacost 100% Pure Peppermint Oil -- 4 fl oz","178"
"Vitacost 100% Pure Lavender Oil -- 4 fl oz","168"
"Taco Tender Holder - Plastic Red Stand - Holds 3 Tacos","106"
"Vitacost Infant Health - Baby-D's Liquid Vitamin D Drops -- 400 IU - 1 fl oz","101"
"Vitacost Butterbur Extract - Standardized -- 75 mg - 120 Capsules","84"
"Tacos, Tortas, and Tamales: Flavors from the Griddles, Pots, and Streetside Kitchens of Mexico","63"
"Vitacost Pine Bark Extract - Standardized to 95% OPC -- 100 mg - 300 Capsules","51"
"Vegan Tacos: Authentic and Inspired Recipes for Mexico's Favorite Street Food","45"

There are several formats supported by Presto-CLI, the default being quoted CSV:

--output-format <output-format>
            Output format for batch mode [ALIGNED, VERTICAL, CSV, TSV,
            CSV_HEADER, TSV_HEADER, NULL] (default: CSV)

So if you want to see column headers in your CSV format output file use --output-format CSV_HEADER

The advantage of using this approach is speed.

Using a Database Tool

If you are using a third-party SQL database tool like DbVisualizer, DBeaver or SQuirreL SQL then the UI will allow you to save the query output. For example, using DbVisualizer, run your query and click the Export… button:

The advantage of this method is the huge number of output formatting options on offer. The disadvantage is it is usually slower than using Presto-CLI. 

The two options above should help you export results of a Select statement.

Presto queries running slow?

Reasons for Presto queries running slow, diagnosis, tuning, things to check and possible solutions:

  1. How many workers do you have in your cluster? If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analysing, planning and optimising queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help.
  2. Have you configured Presto and memory usage correctly?  It is often necessary to change the default memory configuration based on your cluster’s capacity.  The default max memory for a Presto server is 16 GB, but if you have a lot more memory available, you may want to allocate more memory to Presto for better performance. See https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html for configuration details. One rule of thumb:  In each node’s jvm.config, set -Xmx to 80% of the available memory initially, then adjust later based on your monitoring of the workloads.
  3. What kind of instances are your worker nodes running on – do they have enough I/O? Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 type:
  1. Optimize your metadata / data catalog:  Using Presto’s Hive connector for your metastore, like many users do, will mean practically every query will access the Hive metastore for table and partition details etc.  During peak time that generates a high load on the metastore which can slow down query performance. To alleviate this consider:
  1. Setup multiple catalogs. Configure PrestoDB to use multiple thrift metastore endpoints – Presto’s Hive connector supports configuring multiple hive metastore endpoints which are tried in round-robin by the coordinator. https://prestodb.io/docs/current/connector/hive.html 
  2. Enable Hive metastore and carefully tweak cache eviction configurations and TTLs suitable for your data refresh policies
  1. Do you have a separate coordinator node? With PrestoDB you can have a single node act as both a coordinator and worker, which can be useful for tiny clusters like sandboxes for testing purposes but it’s obviously not optimal in terms of performance.  It is nearly always recommended to have the coordinator running on a separate node to the workers for anything other than sandbox use.  Tip:  Check your nodes’ Presto etc/config.properties files to determine which one is the coordinator (look for coordinator=true)
  2. Is memory exhausted? Presto uses an in-memory, pipelining processing architecture and its operation is dependent on the available JVM which in turn is dependent on how much memory Presto is configured to use and how much memory is physically available in the server or instance it is running in.   The workers can be memory hungry when processing very large queries. Monitor their memory usage and look for failed queries. Allocate more memory if necessary and switch to using a more memory-rich machine if practical. The coordinator should be allocated a significant amount of memory – often more than a worker – depending on several factors like workload, the resources available, etc.  It’s not uncommon to see the coordinator alone consuming several tens’ of GBs of memory. 
    • The good news is there is memory information available in at least two places:
      1. Presto’s built-in JMX catalog can help your monitor memory usage with various counters.  Read more about memory pools, limits and counters at https://prestodb.io/blog/2019/08/19/memory-tracking
      2. There is also the Presto Console which reveals, for each query, the reserved, peak and cumulative memory usage.
  1. When was the last time you restarted your Presto cluster? Sometimes, restarting any kind of  software can solve all sorts of issues, including memory leaks and garbage collection.
  2. Is your Presto cluster configured for autoscaling based on CPU usage?  If so check the configuration is what you expect it to be.
  3. Does IO and CPU utilisation look balanced?  Check CPU usage on Presto workers: if their CPUs are not fully saturated, it might indicate the number of Presto worker threads can be made higher, or the number of splits in a batch is not high enough.
  4. Have you checked your data volumes recently? An obvious one to check but data volumes can grow in fits and starts and sometimes peaks occur  unexpectedly. The queries may simply be taking longer because there is x% more data than last month.
  1. Other configuration settings for Task concurrency, initial splits per node, join strategy, driver tasks… PrestoDB has around 82 system configurations and 50+ hive configuration settings which users can tweak, many at the query level. These are however for advanced users, which falls outside the scope of this article. More information can be found in the PrestoDB documentation.

Ahana Wins Best Big Data Startup in Fifth Annual Datanami Readers’ and Editors’ Choice Awards

Datanami Awards Given to Leaders in the Big Data and Data Science Communities

San Mateo, CA – October 22, 2020 AhanaTM, the self-service analytics company for Presto, announced today it has been recognized in the fifth annual Datanami Readers’ and Editors’ Choice Awards. Ahana was awarded Editors’ Choice: Best Big Data Startup. Ahana provides the first cloud-native Managed Service for Presto designed to simplify the deployment, management and integration of Presto, an open source distributed SQL query engine, with data catalogs, databases and data lakes on Amazon Web Services (AWS).

The Datanami Readers’ and Editors’ Choice Awards recognizes the companies, products, and projects that have made a difference in the big data community this year. The list of this year’s winners were revealed on the Datanami website, located here: https://www.datanami.com/2020-datanami-readers-choice-awards/

“Ahana was founded to address the complexities inherent in providing open data lake analytics systems, where data platform engineers are left to operate many distributed systems, integrating components like the metadata catalog, multiple data sources, and visualization tools. Deploying and managing complex software in the cloud for this reason can be challenging,” said Steven Mih, cofounder and CEO, Ahana. “We are honored that Datanami editors recognized Ahana, our contributions to the PrestoDB open source project, and our ground-breaking cloud-native managed service to make data analytics easier than ever.”

“Years ago, we launched the Datanami Readers’ Choice Awards to give our readers a voice in honoring the best big data solutions in the market,” said Alex Woodie, Managing Editor of Datanami. “Today, as we continue this proud tradition, we are pleased to announce the winners of the fifth annual Readers’ and Editor’s Choice Awards. I want to personally thank all of the Datanami readers who participated in the voting process and congratulate all of the 2020 RCA and ECA winners.” 

More information on these awards can be found at the Datanami website (www.datanami.com) or on Twitter through the following hashtag: #DatanamiAwards.

About Datanami

Datanami is one of the premier online portals covering big data, produced by Tabor Communications, Inc. (TCI) a leading international media, advertising and communications company dedicated to covering advance computing technologies for science, research and business. Datanami is a premier online portal covering big data. It is produced by Tabor Communications, Inc. Receive a complimentary subscription to the weekly Datanami e-newsletter by visiting datanami.com/subscribe. View daily news at www.datanami.com.

About Ahana

Ahana, the self-service analytics company for Presto, is the only company with a cloud-native managed service for Presto on Amazon Web Services that simplifies the deployment, management and integration of Presto and enables cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV, Lux Capital, and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

# # # 

Media Contact:
Beth Winkowski
Winkowski Public Relations, LLC
978-649-7189
beth@ahana.io

Extract keys in a nested JSON array object

Let’s say we have some JSON that looks like this:

[{"id": 1, "value":"xxx"}, {"id": 2, "value":"yyy"}]

In this instance our json contains key value pairs. How can we decode that using PrestoDB SQL to extract only the keys (1,2 in this example)?  

We can do this using SQL arrays, the JSON_PARSE function and the TRANSFORM function.  The TRANSFORM function’s syntax is:

transform(array(T), function(T, U)) -> array(U)

Returns an array that is the result of applying function to each element of array:

For this example query we’ll also use the VALUES function to provide the test data – VALUES is a handy way to define an inline  table – kind of like a dummy table.  

select 
TRANSFORM(CAST(JSON_PARSE(array) AS ARRAY<MAP<VARCHAR, VARCHAR>>), entry -> entry['id']) as My_Keys 
from (values ('[{"id": 1, "value":"xxx"}, {"id":2, "value":"yy"}]')) t(array);

 My_Keys 
--------
 [1, 2]  

This works by first creating an array containing the input JSON data called ‘array’.  We then use JSON_PARSE which converts a string or array containing JSON text into deserialized JSON values, and we cast those values as an array with two varchar elements.   TRANSFORM then returns an array containing only the elements we want  (we call our desired elements ‘entry’ in this example but it could be called anything).

More details on arrays can be found in the PrestoDB docs: https://prestodb.io/docs/current/functions/array.html 

And details on JSON functions are documented here: https://prestodb.io/docs/current/functions/json.html 

So with Presto’s rich set of array and JSON functions we can see how to easily extract keys in a nested JSON array object.

Generate Interval from Variable Using PrestoDB

There are numerous ways of generating intervals using PrestoDB SQL for different data types.

Dates

Date intervals can be generated for a given date range using a combination of the following not-so-well-known SQL functions:

  • SEQUENCE() – generates a sequence of values between a given start and stop range. It generates an ARRAY type. Function syntax is:

sequence(start, stop, step) -> array(timestamp)

Generate a sequence of timestamps from start to stop, incrementing by step. The type of step can be INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH. Using step can choose the interval we need e.g. INTERVAL ‘1’ DAY

Note that there is an array size upper limit of 10000.

  • FROM_ISO8601_DATE() – Parses a ISO 8601 formatted string into a date.
  • VALUES() – defines an inline / dummy table.  You can use VALUES() anywhere a query can be used e.g. as the FROM clause of a SELECT, an INSERT
  • UNNEST() – this function is used to expand an ARRAY into a single column. 

Example 1

Here’s an example query showing how to generate dates for the first 2 weeks of June with an interval of 1 day .  We create an array with the sequence of dates we need called date_array, then convert the array into a series of values with a join.

SELECT CAST(date_column AS DATE) date_column 
FROM (
  VALUES (SEQUENCE(FROM_ISO8601_DATE('2020-06-01'),    
                   FROM_ISO8601_DATE('2020-06-14'), INTERVAL '1' DAY) ) ) AS t1(date_array) 
CROSS JOIN UNNEST(date_array) AS t2(date_column) ;

 date_column 
-------------
 2020-06-01  
 2020-06-02  
 2020-06-03  
 2020-06-04  
 2020-06-05  
 2020-06-06  
 2020-06-07  
 2020-06-08  
 2020-06-09  
 2020-06-10  
 2020-06-11  
 2020-06-12  
 2020-06-13  
 2020-06-14  
(14 rows)

Example 2

Here’s a slightly different approach this time using the DATE_ADD function. We specify a start date in a table and the result set begins with this given date plus the next n (10 in this example) dates. We’re still using an array but the array from the SEQUENCE function is fed straight to the UNNEST function.

create table table1(start_date date);

insert into table1 values date('2020-10-13');

select DATE_ADD('day',s.n,t.start_date) as date from table1 t cross join UNNEST(SEQUENCE(0,10)) s (n);
    date    
------------
 2020-10-13 
 2020-10-14 
 2020-10-15 
 2020-10-16 
 2020-10-17 
 2020-10-18 
 2020-10-19 
 2020-10-20 
 2020-10-21 
 2020-10-22 
 2020-10-23 
(11 rows)

Integers

Example

Here’s a simple example generating a sequence of integers from 20 to 0 in steps of -2:

 SELECT x  FROM UNNEST(sequence(20, 0, -2)) t(x);
 x  
----
 20 
 18 
 16 
 14 
 12 
 10 
  8 
  6 
  4 
  2 
  0 
(11 rows)

These examples should help you when trying to generate an interval from a variable.

Presto Data Sources

Presto was designed and written from the ground up to efficiently query data against data sources of all sizes, ranging from gigabytes to petabytes. Presto connects to a wide variety of data sources, from HDFS to traditional relational databases, as well as NoSQL data sources such as Cassandra. Presto is particularly equipped to perform multiple concurrent interactive queries against a data source.

Presto is obviously a great fit for companies that have disparate data sources. For those organizations that can’t consolidate all of their data into one centralized store, Presto’s data federation capabilities can create a unified query layer that enables you to blend your data across different data sources together. With Presto, you can leverage many data sources at once, which means Presto can handle very large volumes of data.

Here’s a list of the most popular data sources that Presto connects to:

Accumulo

Alluxio

Amazon Redshift

Amazon S3

Cassandra

Druid

Elastic

HDFS

Kafka

Kudu

Microsoft SQL Server

MongoDB

Phoenix

Pinto

RDS PostgreSQL

RDS MySQL

Redis

Teradata

Presto Server Types

Coordinators and workers are the two types of Presto servers. Here’s a look at each server type in more detail.

A Presto coordinator parses statements, plans queries, and manages Presto worker nodes. A Presto installation is also the node to which a client connects to submit statements for execution. Every Presto installation must have a Presto coordinator alongside one or more Presto workers. For development or testing purposes, a single instance of Presto can be configured to perform both roles.

The coordinator keeps track of the activity on each worker and coordinates the execution of a query. The coordinator creates a logical model of a query involving a series of stages, which are then translated into a series of connected tasks running on a cluster of Presto workers.

A Presto worker is actually a server. Its basic job is to execute tasks assigned by the coordinator and to process data. When it starts working, it alerts the discovery server, which makes it available to the coordinator. The worker then obtains the data from connectors and exchanges data with the connectors. The coordinator then takes the results from the worker and sends those results back to the client.

Workers use an HTTP-based protocol to communicate with both the coordinator and other workers.

How To Escape a Single Quote With Presto

In reality there will be some occasions when you need to use a quote character in your data, in your query, or in result sets, but you want the SQL interpreter to treat the quote as “just another character”, as opposed a quote which has special meaning – namely to denote a string literal in most SQL dialects. This is called escaping a character.

Some SQL dialects use a backslash as the escape character. Presto’s SQL needs no escape character; instead it uses a double character notation. Here’s a simple example of what I mean:

presto:default> select 'hello world';
 _col0 
-------
 hello world

What if I want single quotes to actually appear in the result?  I need to do this in Presto:

presto:default> select '''hello world''' as col1;
  col1   
---------
 'hello world' 

I have simply used two single quotes to tell the interpreter the places where “I really want to include a single quote”. 

Another example: What if there are single quotes in my table? 

presto:default> select * from mytable where column_a in ('Driver's License');

Query n failed: line 1:43: mismatched input 'in'. Expecting: 'AND', 'EXCEPT', 'GROUP', 'HAVING', 'INTERSECT', 'LIMIT', 'OR', 'ORDER', 'UNION', <EOF>

The query was rejected because the interpreter encountered an unexpected single quote in “Driver’s License”.  To handle this we use two single quotes:

presto:default> select * from mytable where column_a in ('Driver''s License');

(query runs ok)

So that’s how to escape a single quote in Presto.

Is Presto Free?

PrestoDB is a free, open source federated, distributed SQL query engine used for ad hoc analytics. The PrestoDB AMI is 100% open source and available for use in production immediately. PrestoDB AMI is the first and only completely open source and completely free edition of the software.

At Ahana, we have a team of PrestoDB experts who are actively contributing to the open source project and community. Our mission is to help you be successful with your PrestoDB deployments. 

Currently you can have free Early Access to Ahana Cloud for Presto, the first full integrated, cloud-native managed service for Presto. It is the easiest Presto experience ever, and requires no installation, no AWS AMIs or CFTs and no configuration. You can be running SQL queries on your AWS data within 30 minutes. You can sign up for free here.

Check if null and return default (NVL analog)

SQL has a number of functions for dealing with null values, and everyone has their favorite. The complicating factor is that not all SQL engines support all the functions, and some SQL dialects have their own versions of the functions which may be completely different (remember the “S” in SQL stands for Structured, not Standard!)

Let’s look at the functions used to substitute alternative, or default, values for NULLs. These functions are really useful if you need to see zeros instead of nulls in your results, or if you need to avoid division by zero errors. 

Oracle has the NVL(column_x, value_y) function which will return column_x UNLESS it is a null, in which case the value_y will be returned instead. 

In Microsoft SQL Server, the function is ISNULL(column, value)

In MySQL, you use IFNULL(column, value)

In most databases (MySQL included), you can use COALESCE(value1, value2, …) which returns the first non-null value in the list.

And in some databases, you will find the function NULLIF(value1, value2) which returns null if value1 equals value2, otherwise returns value1.

And in most databases, you can use good old CASE, for example:

SELECT a
       CASE
           WHEN a = 999 THEN 'this is too much'
           WHEN a is null THEN 'this is null'
           ELSE 'we are good'
       END
FROM …

What about PrestoDB’s support when it comes to handling nulls?  PrestoDB  supports the following:

FunctionDescription
COALESCE(value1, value2, …)returns the first non-null value (or column) in the list.
NULLIF(value1, value2) returns null if value1 equals value2, otherwise returns value1.
CASE..WHEN..END various forms supported – see https://prestodb.io/docs/current/functions/conditional.html for more information. Use with IS NULL to match null values. 
IF(condition, true_value, false_value)Evaluates and returns true_value if condition is true, otherwise false_value. Use with IS NULL to match null values. 

You’ll notice NVL() is not supported by PrestoDB. Use COALESCE(column_x, value_y) instead as a direct replacement. Or use IF(), or CASE statement. 

Some examples using PrestoDB:

--Check we have a null in our data
presto:default> select partkey, brand, container from parts_null_test where partkey=200001;
 partkey |   brand    | container 
---------+------------+-----------
  200001 | SuperBrand | NULL      


--Turn nulls into the string ‘Single’ using coalesce
presto:default> select partkey, brand, COALESCE(container,'SINGLE') as CONTAINER from parts_null_test where partkey=200001;
 partkey |   brand    | CONTAINER 
---------+------------+-----------
  200001 | SuperBrand | SINGLE    

--Use the IF() method to turn nulls into the string ‘Single’
presto:default> select partkey, brand, IF(container is null,'SINGLE', container) as CONTAINER from parts_null_test LIMIT 5;
 partkey |   brand    | CONTAINER  
---------+------------+------------
  200001 | SuperBrand | SINGLE     
  100001 | Brand#35   | JUMBO CASE 
  100002 | Brand#34   | WRAP CAN   
  100003 | Brand#21   | SM BOX     
  100004 | Brand#42   | SM CASE    
 
--Use a CASE statement
presto:default> select partkey, brand, CASE WHEN container is null THEN 'SINGLE' ELSE container END as CONTAINER from parts_null_test LIMIT 5;
 partkey |   brand    | CONTAINER  
---------+------------+------------
    7024 | Brand#52   | JUMBO PACK 
    7025 | Brand#15   | LG PACK    
    7026 | Brand#55   | LG BOX     
    7023 | Brand#22   | LG DRUM    
  200001 | SuperBrand | SINGLE     

The above information should help you to check if null, and return default (NVL analog) using COALESCE(), IF() or CASE in your PrestoDB SQL queries.

Static date and timestamp in where clause

Many databases automatically convert between CHAR or VARCHAR and other types like DATE and TIMESTAMP as a convenience feature.  Using constants in a query are also often auto-converted.  Take these example queries which count rows where the table’s DATE column is equal to a specific date, or range of times:

Select count(*) from myTable where myDateCol = '2020-09-28';

Select count(*) from myTable where myTimeCol between '2020-09-28 14:00:00.000' and '2020-09-28 14:59:59.000';

The underlying table has columns defined as DATE and TIMESTAMP types.  But the query is actually providing date and timestamp literals in the form of strings!  Some databases will tolerate this, but Presto is stricter in this respect and will give you an error like this:

presto:sf1 > select count(*) from orders where orderdate < '2020-09-01';
Query failed: line 1:45: '<' cannot be applied to date, varchar(10)

With Presto you must either cast your data types or a slightly simpler way is to use the date or timestamp type constructors:

$ presto --schema sf1 --catalog tpch

presto:sf1> select count(*) from orders where orderdate < date '2020-09-01';
  _col0  
---------
 1500000 
(1 row)

> select count(*) from transactions where myTimestampCol between timestamp '2020-09-01 22:00:00.000' and timestamp '2020-09-01 22:59:59.000';
  _col0  
---------
   42 
(1 row)

--Using CAST() also works:
presto:sf1> select count(*) from orders where orderdate < CAST('2020-09-01' as DATE);
  _col0  
---------
 1500000 
(1 row)

In summary, by using CAST() or the DATE/TIMESTAMP type constructors you will be able to specify a static date and timestamp in where clause.

Where can I find the Presto Server Bootstrap logs?

A common question is “where are Presto’s bootstrap logs?”  These are logs that indicate what is happening to the cluster on startup. The Presto server generates them, and the content looks like this:

2020-05-22T11:59:32.045-0400 INFO main io.airlift.log.Logging Logging to stderr
2020-05-22T11:59:32.049-0400 INFO main Bootstrap Loading configuration
2020-05-22T11:59:32.224-0400 INFO main Bootstrap Initializing logging
2020-05-22T11:59:33.268-0400 INFO main Bootstrap PROPERTY DEFAULT RUNTIME DESCRIPTION
2020-05-22T11:59:33.268-0400 INFO main Bootstrap discovery.uri null http://localhost:8080 Discovery service base URI
2020-05-22T11:59:33.268-0400 INFO main Bootstrap service-inventory.uri null null Service inventory base URI
2020-05-22T11:59:33.268-0400 INFO main Bootstrap service-inventory.update-interval 10.00s 10.00s Service inventory update interval
2020-05-22T11:59:33.268-0400 INFO main Bootstrap discovery.presto.pool general general
2020-05-22T11:59:33.268-0400 INFO main Bootstrap discovery.collector.pool general general
2020-05-22T11:59:33.270-0400 INFO main Bootstrap discovery.max-age 30.00s 30.00s
2020-05-22T11:59:33.270-0400 INFO main Bootstrap discovery.store-cache-ttl 1.00s 1.00s
2020-05-22T11:59:33.270-0400 INFO main Bootstrap dynamic.store.gc-interval 1.00h 1.00h
2020-05-22T11:59:33.271-0400 INFO main Bootstrap dynamic.store.remote.max-batch-size 1000 1000
...

In the above excerpt you can see Bootstrap-specific lines.

There are 3 logs of interest :

server.logThis is the main log file used by Presto. If the server fails during initialization, it usually contains relevant information. It will automatically rotate and compress.
launcher.logThis log is created by the launcher and is connected to the server’s stdout and stderr streams. It will contain some log messages that occurred while initializing server logging and any error or diagnostic information generated by the JVM.
http-request.logThis is the HTTP request log, which contains every HTTP request received by the server. It will automatically rotate and compress.

The log files’ location depends on where you are running your PrestoDB cluster and how you have it configured:

  • If you are running on EC2 look here: /media/ephemeral0/presto/var/log/bootstrap.log 
  • Other places to check (not just for the server log) are: /var/log/ and /var/lib/presto/data/var/log/ and /var/presto/data/

If all else fails, look in your config.properties file (look in /opt/presto-server/etc/) to see if there is anything deployment-specific configured for logging. 

Configuration tip: Presto needs a data directory for storing logs, etc. and it is recommended this is created in a data directory outside of the installation directory (which allows it to be easily preserved when upgrading Presto). 

  • Presto’s node properties file etc/node.properties sets the location of the data directory. 
  • In your node.properties file set the node.data-dir property to point to the location (filesystem path) of the intended data directory – Presto will store logs and other data here. 

Finally, if you are wondering where to start when diagnosing bootstrap issues, it is worth looking at the output in the bootstrap section of the server.log file first. 

Armed with the above information you should now be able to find the Presto Server Bootstrap logs.

Where can I find different Presto metrics for monitoring?

There are several ways to monitor Presto.  Let’s look at some options.

1. Presto Console

Presto provides a web interface for monitoring and managing queries. The web interface is accessible on the Presto coordinator via HTTP, using the HTTP port number specified in the coordinator Config Properties (default is 8080). The console’s main page looks like this:

This main page has a list of queries along with information like unique query ID, query text, query state, percentage completed, username and source from which this query originated. The currently running queries are at the top of the page, followed by the most recently completed or failed queries.  A query can have one of several states:

  • QUEUED – Query has been accepted and is awaiting execution.
  • PLANNING – Query is being planned.
  • STARTING – Query execution is being started.
  • RUNNING – Query has at least one running task.
  • BLOCKED – Query is blocked and is waiting for resources (buffer space, memory, splits, etc.).
  • FINISHING – Query is finishing (e.g. commit for autocommit queries).
  • FINISHED – Query has finished executing and all output has been consumed.
  • FAILED – Query execution failed.

The following console screenshot shows an example of an executed query; here’s a breakdown of the stats:

  • The query was run by the root user, using the presto-cli tool. 
  • 54 splits were completed during execution.
  • It took 5.17 seconds (wall clock time) to run. 
  • The query consumed 19.7 secs of CPU time – this is greater than the wall clock time since there are multiple CPUs and multiple cores at work.  
  • In terms of (JVM) memory, the query used up to 59.2MB during execution.  
  • The “Cumulative User Memory” (35.5M) is the sum of all (user) memory consumption across all query stages from queryStats. The unit of this metric is M seconds which means the memory data size consumed by the user to execute the query in total, excluding the memory consumption of the system:

Click on the query ID link (20200925 in this example) and you will see a LOT more detail:

Notice the “Live Plan” tab at top-right which will give you a graphical representation of the query plan, which you read from the bottom up:

The plan is clickable – click a stage and you can drill down into more detail. 

2. JMX

Java Management Extensions (JMX) provides information about the Java Virtual Machine and all of the software running inside it. 

JMX is actually a connector which has been configured so that chosen JMX information will be periodically dumped and stored in tables (in the “jmx” catalog)  which can be queried. Note this can be controlled in the properties file: /etc/catalog/jmx.properties

JMX is useful for monitoring and debugging Presto. 

Here’s how to query it using presto-cli:

$ presto --schema jmx --catalog jmx
> select * from jmx.information_schema.tables;   -- lists tables in the information schema

> show tables from jmx.information_schema; 	    -- another way of listing tables

> select * from jmx.information_schema.views;    -- lists all views 

The most useful JMX schema for monitoring is “current” which contains every MBean from every node in the Presto cluster.  The following query uses the “current” schema to return information from the different Presto memory pools on each node:

> SELECT freebytes, node, object_name FROM jmx.current."com.facebook.presto.memory:*type=memorypool*";


 freebytes |     node     |                       object_name                        
-----------+--------------+----------------------------------------------------------
 322122547 | 4768d52a6258 | com.facebook.presto.memory:type=MemoryPool,name=reserved 
 429496730 | 4768d52a6258 | com.facebook.presto.memory:type=MemoryPool,name=general  
(2 rows)

More info on monitoring can be found in the docs – check out this blog for details on memory-related monitoring, management and memory pools: https://prestodb.io/blog/2019/08/19/memory-tracking 

3. REST API

You can make a simple REST call to Presto to get a dump of recently run queries using:

http://<prestoServerHost>:8080/v1/query

The response is in JSON format.

 You can optionally specify a query ID – in this example my query ID is 20200926_204458_00000_68x9u:

http://<prestoServerHost>:8080/v1/query/20200926_204458_00000_68x9u

4. Third-Party Tools

You can also monitor Presto using third-party tools like Datadog, Prometheus, etc.

The above is not an exhaustive list, but I hope it helps you find different Presto metrics for monitoring.

Presto equivalent of mysql group_concat

PrestoDB supports ANSI SQL and includes support for several SQL dialects, including MySQL dialect, making it easy to group and aggregate data in a variety of ways. However not ALL functions in MySQL are supported by PrestoDB! 

Let’s look at the really useful MySQL and MariaDB SQL function GROUP_CONCAT() – this is used to concatenate data in column(s) from multiple rows into one field. It is an aggregate (GROUP BY) function which returns a String, assuming the group contains at least one non-NULL value (otherwise it returns NULL). GROUP_CONCAT() is an example of a function that is not yet supported by PrestoDB and this is the error you’ll see if you try using it to get a list of customers that have ordered something along with their order priorities:

presto> use tpch.sf1;

presto:sf1> select custkey, GROUP_CONCAT(DISTINCT orderpriority ORDER BY orderpriority SEPARATOR ',') as OrderPriorities from orders GROUP BY custkey;

Query 20200925_105949_00013_68x9u failed: line 1:16: Function group_concat not registered

So what’s the workaround?  array_join() and array_agg() to the rescue! 

presto:sf1> select custkey,array_join(array_distinct(array_agg(orderpriority)),',') as OrderPriorities from orders group by custkey;
 custkey |                OrderPriorities                 
---------+------------------------------------------------
   69577 | 2-HIGH,1-URGENT,3-MEDIUM,5-LOW,4-NOT SPECIFIED 
   52156 | 4-NOT SPECIFIED,3-MEDIUM,1-URGENT,5-LOW,2-HIGH 
  108388 | 5-LOW,4-NOT SPECIFIED,2-HIGH,3-MEDIUM,1-URGENT 
  111874 | 5-LOW,1-URGENT,2-HIGH,4-NOT SPECIFIED          
  108616 | 1-URGENT,5-LOW,4-NOT SPECIFIED,3-MEDIUM,2-HIGH 
(only the first 5 rows displayed) 

If you do not want to use the DISTINCT operator (you want duplicates in your result set in other words) then simply drop the array_distinct() function from your query:

presto:sf1> select custkey,array_join(array_agg(orderpriority),',') as OrderPriorities from orders group by custkey;
 custkey | OrderPriorities                             
---------+-------------------------------------------------------------------------------- 
   24499 | 5-LOW,1-URGENT,4-NOT SPECIFIED,3-MEDIUM,2-HIGH,4-NOT SPECIFIED,3-MEDIUM,1-URGENT,2-HIGH,3-MEDIUM,1-URGENT,5-LOW,3-MEDIUM,4-NOT SPECIFIED,4-NOT SPECIFIED,4-NOT SPECIFIED,3-MEDIUM,3-MEDIUM,5-LOW,1-URGENT,1-URGENT,4-NOT SPECIFIE
   58279 | 4-NOT SPECIFIED,2-HIGH,5-LOW,1-URGENT,1-URGENT,5-LOW,5-LOW,4-NOT SPECIFIED,1-URGENT,4-NOT SPECIFIED,5-LOW,3-MEDIUM,1-URGENT,4-NOT SPECIFIED,4-NOT SPECIFIED,1-URGENT,5-LOW,5-LOW,3-MEDIUM,3-MEDIUM,1-URGENT,3-MEDIUM,2-HIGH,5-LOW
  142027 | 1-URGENT,2-HIGH,2-HIGH,1-URGENT,3-MEDIUM,1-URGENT,5-LOW,4-NOT SPECIFIED,4-NOT SPECIFIED,2-HIGH,3-MEDIUM,2-HIGH,1-URGENT,3-MEDIUM,5-LOW,3-MEDIUM,4-NOT SPECIFIED,2-HIGH,1-URGENT,5-LOW,2-HIGH,5-LOW,1-URGENT,4-NOT SPECIFIED,2-HIG
   94169 | 1-URGENT,4-NOT SPECIFIED,4-NOT SPECIFIED,1-URGENT,4-NOT SPECIFIED,3-MEDIUM,4-NOT SPECIFIED,3-MEDIUM,4-NOT SPECIFIED,5-LOW,4-NOT SPECIFIED,2-HIGH,5-LOW,4-NOT SPECIFIED                                                                                                                                                                                                        
   31607 | 4-NOT SPECIFIED,2-HIGH,4-NOT SPECIFIED,2-HIGH,2-HIGH,5-LOW 

You can of course specify the separator character – in the above example I’ve used a comma.

It’s worth noting that, like PrestoDB, there wasn’t a T-SQL equivalent of the MySQL GROUP_CONCAT() function in Microsoft SQL Server either.  However, T-SQL now has the STRING_AGG() function which is available from SQL Server 2017 onwards.

And hey, presto, you now have a working Presto equivalent of mysql group_concat.

How to get the Presto CLI Tool

The Presto CLI provides a terminal-based interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal executable. Here are the steps:

  1. Choose one of three ways to install it: 
    • If you use Docker, install the Ahana sandbox in Docker
    • If you have a Mac and use brew, simply run “brew install presto”
    • Download only the presto-cli tar file from https://prestodb.io/download.html (this will work on any OS that has java installed) 

Note that the first two options will install PrestoDB server as well.  All three options will give you the presto-cli executable jar client that you can use to connect to any Presto cluster – including Ahana Cloud for Presto of course.

  1. Next check your Ahana Cloud console for the Presto cluster endpoint URL. You will find it in the Cluster Connection details:


  2. And now connect using that URL. Here’s an example that connects you to Ahana’s integrated querylog.
MacBook$ presto --server https://MyClusterName.app --schema public --catalog ahana_querylog
presto:public> select * from querylog;

      type      | cluster_name |           ts            | seq |    user     | query 
----------------+--------------+-------------------------+-----+-------------+------------
 queryCreated   | devtest      | 2020-09-22 13:03:32.000 |   1 | jamesmesney | show catalogs                                                                                                          
 queryCreated   | devtest      | 2020-09-22 13:03:54.000 |   3 | jamesmesney | use ahana_querylog                                                                                                     
 queryCreated   | devtest      | 2020-09-22 13:04:15.000 |   4 | jamesmesney | use ahana_hive                                                                                                         
 queryCreated   | devtest      | 2020-09-22 13:06:28.000 |   5 | jamesmesney | SHOW FUNCTIONS                                                                                                         
 queryCreated   | devtest      | 2020-09-22 13:15:13.000 |   8 | jamesmesney | show catalogs                                                                                                          
 queryCreated   | devtest      | 2020-09-22 13:15:19.000 |  10 | jamesmesney | use ahana_hive                                                                                                         
 queryCompleted | devtest      | 2020-09-22 13:15:19.000 |  11 | jamesmesney | use ahana_hive                                                                                                         
 queryCreated   | devtest      | 2020-09-22 13:15:20.000 |  13 | jamesmesney | SELECT table_name FROM information_schema.tables WHERE table_schema = 'default'                                        
...

One more example to get you started with the integrated tpch data:

MacBook$ presto --schema sf1 --catalog tpch

presto:sf1> show tables;

  Table   

----------

 customer 

 lineitem 

 nation   

 orders   

 part     

 partsupp 

 region   

 supplier 

(8 rows)

presto:sf1> select count(*) from orders where orderdate < date '2020-09-01';

  _col0  

---------

 1500000 

(1 row)

If you have a Windows machine and have downloaded the presto cli jar file from https://prestodb.io/download.html and having trouble running it try this (replace ‘nnn’ with the your version):

c:\ java -jar presto-cli-0.nnn-executable.jar --server <host:port> --catalog <catalog_name> --schema <schema_name> --user <user_name> --<password>

Presto vs Spark With EMR Cluster

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, solely on AWS.  An EMR cluster with Spark is very different to Presto:

  • EMR is a data store. Presto on the other hand stores no data – it is a distributed SQL query engine, a federation middle tier. Presto users can query data in EMR, and combine it with data from many other sources for which Presto connectors are provided such as RDBMSs, noSQL DBs, files, object stores, Elasticsearch, etc.
  • Spark is a general-purpose cluster-computing framework that can process data in EMR.  Spark core does not support SQL – for SQL support you install the Spark SQL module which adds structured data processing capabilities. Spark is not designed for interactive or ad hoc queries and is not designed for federating data from multiple sources ; for this Presto is a better choice.

There are some similarities: EMR, Spark and Presto share distributed and parallel architectures, and are all designed for dealing with big data.  And PrestoDB is included in Amazon EMR release version 5.0.0 and later. 

A typical EMR deployment pattern is to run Spark jobs on an EMR cluster for very large data I/O and transformation, data processing, and machine learning applications.  EMR offers easy provisioning, auto-scaling, fault tolerance, and as you’d expect it has good integration with the AWS ecosystem like S3, DynamoDB and Redshift. An EMR cluster may be configured as “long running” or a transient cluster that auto-terminates once the processing job(s) have completed.

EMR comes with some disadvantages:

  • EMR do not offer support for Presto – users must create their own Presto metastore, configure connectors, install and configure and tools they need. 
  • EMR can be complex – if you have a database requirement, then provisioning EMR, Spark and S3 and ensuring you use the right file formats, networking, roles and security, can take much, much longer than deploying a packaged MPP database solution like Redshift. 
  • When an EMR cluster terminates, all Amazon EC2 instances in the cluster terminate, and data in the instance store and EBS volumes is no longer available and not recoverable. This means you can’t stop an EMR cluster and retain data like you can with EC2 instances (even though EMR runs on EC2 instances under the covers). The data in EMR is ephemeral, and there’s no “snapshot” option (because EMR clusters use instance-store volumes).  The only workaround is to store all your  data in EMR to S3 before each shutdown, and then ingest it all back into EMR on start-up. Users must develop a strategy to manage and preserve their data by writing to Amazon S3 and manage the cost implications. 
  • On its own EMR doesn’t include any tools – no analytical tools, BI, Visualisation, SQL Lab or Notebooks. No Hbase or Flume. No hdfs access cli even. So you have to roll your own by doing the tool integrations yourself and deal with the configuration and debugging effort that entails. That can be a lot of work.
  • EMR has no UI to track jobs in real time like you can with Presto, Cloudera, Spark, and most other frameworks. Similarly EMR has no scheduler.
  • EMR has no interface for workbooks and code snippets in the cluster – this increases the complexity and time taken to develop, test and submit tasks, as all jobs have to go through a submitting process. 
  • EMR is unable to automatically replace unhealthy nodes.
  • The clue is in the name – EMR – it uses the MapReduce execution framework which is designed for large batch processing and not ad hoc, interactive processing such as analytical queries. 
  • Cost: EMR is usually more expensive than using EC2, installing Hadoop and running an always-on cluster. Persisting your EMR data in S3 adds to the cost.

When it comes to comparing an EMR cluster with Spark vs Presto technologies your choice ultimately boils down to the use cases you are trying to solve. 

Spark SQL vs Presto

When it comes to comparing Spark SQL vs Presto there are some differences to be aware of: 

Commonality: 

  • They are both open source, “big data” software frameworks
  • They are distributed, parallel, and in-memory
  • BI tools connect to them using JDBC/ODBC
  • Both have been tested and deployed at petabyte-scale companies
  • They can be run on-prem or in the cloud. They can be containerized

Differences:

  • Presto is an ANSI SQL:2003 query engine for accessing and unifying data from many different data sources.  It’s deployed as a middle-layer for federation.
  • Spark is a general-purpose cluster-computing framework. Core Spark does not support SQL – for SQL support you install the Spark SQL module which adds structured data processing capabilities. Spark SQL is also ANSI SQL:2003 compliant (since Spark 2.0).
  • Presto is more commonly used to support interactive SQL queries.  Queries are usually analytical but can perform SQL-based ETL.  
  • Spark is more general in its applications, often used for data transformation and Machine Learning workloads. 
  • Presto supports querying data in object stores like S3 by default, and has many connectors available. It also works really well with Parquet and Orc format data.
  • Spark must use Hadoop file APIs to access S3 (or pay for Databricks features). Spark has limited connectors for data sources. 

That’s our take on Spark SQL vs Presto and we hope you found it useful! 

Presto Catalogs

Presto has several important components that allow you to easily manage data. These components are catalogs, tables and schemas, and connectors. Presto accesses data via connectors; each data source is configured as a catalog, and you can query as many catalogs as you want in each query. The catalogs contain schemas and information about where data is located. Every Presto catalog is associated with a specific connector. Keep in mind that more than one catalog can use the same connector to access different instances of the same data source. 

Catalogs are defined in properties files stored in the Presto configuration directory. Schema is what you use to organize your tables. Catalogues and schemas are how you define what will be queried. When addressing a table in Presto, the fully-qualified table name is always rooted in a catalog. For example, a fully-qualified table name of hive.test_data.test would refer to the test table in the test_data schema in the hive catalog.

If you run a SQL statement in Presto, you are running it against one or more catalogues. For example, you can configure a JMX catalog to provide access to JMX information via the JMX connector. Other examples of catalogs include the Hive catalog to connect to a Hive data source.

You can have as many catalogs as you need. For example, if you have additional Hive clusters, you simply add additional properties file to etc/catalog with a different name, making sure it ends in .properties. For example, if you name the property file sales.properties, Presto creates a catalog named sales using the configured connector.

Where to Find Presto Source Code, and How to Work With it

The main branch: PrestoDB source code

Presto is an open source project that is developed in the open on the public Github repository: https://github.com/prestodb/presto. The prestodb repo is the original, master repo from when it was first developed at Facebook and subsequently open sourced the code base on Github in 2013 under the Apache 2.0 license, a permissive license which lets anyone download and use the code. 

Whether you call it master, trunk, upstream, or mainline, those all refer to the prestodb repo: https://github.com/prestodb/presto. That means it is the single, shared, current state of the software project. Whenever you wish to start a new piece of work, you would pull code from this origin into your local repository and begin coding. The master repo is the single shared, codeline and represents the central repo and single point of record for the project. Most Presto clones or forks come from prestodb. 

It is worth noting that the Facebook team continues to develop and run the prestodb project in production at Facebook at scale. So the community benefits from the development and testing that Facebook and other companies who run presto. There are numerous other contributors of course, many of them working within organisations that have deployed PrestoDB at scale.

What about other forks?

As an open source project, community members can download the source code and work on it in their own public or private repos. Those members can decide to contribute changes back through the traditional github development process of pull requests, reviews, and commits. If you starting anew, it is generally recommended to always pull source code from the master, mainline prestodb open source repo:  https://github.com/prestodb/presto

Some members can decide to not contribute back and then that Presto version becomes a fork of the code. There are always a number of forks out in the community and as time goes on, the development tends to diverge away from the original codeline. The fork misses out on upstream changes and testing that companies like Facebook and others do, unless the fork is merged back with upstream. 

A close up of a map

Description automatically generated
Source: https://martinfowler.com/articles/branching-patterns.html

What about PrestoSQL source code?

PrestoSQL is a fork of the original Presto project. It has a separate github repository here: https://github.com/prestosql/presto

PrestoDB is the main project of Linux Foundation’s Presto Foundation. That Foundation has a wide ranging set of industry members including Facebook, Uber, Twitter, Alibaba. In addition, Starburst is another industry member who has joined. As such, PrestoDB is the project repo of Presto both today and in the future. 

How can I work with the Presto source code?

The easiest way is thru your own github account. If you don’t yet have a github account, it is free and easy to sign up. Then get to https://github.com/prestosql/presto and clone the repository into your github account. If you’d like to work off your laptop, you can use a wide variety of free tools, like atom or github desktop. Now you can get coding and compiling! Feel free join the PrestoDB slack channel to ask questions of the community  (or answer questions too!). There is also the Presto Users google group at https://groups.google.com/g/presto-users and Presto developers are active on Stack Overflow https://stackoverflow.com/questions/tagged/presto

How can I contribute back to the PrestoDB source?

When you’re ready to contribute your code back to the community, you can use Github to raise pull request and get your code reviewed. You’ll need to e-sign a Contributor License Agreement (CLA) that is part of all Linux Foundation projects. 

Note, that there are other ways to contribute back to the community. The code base is wide and varied, there’s opportunities to write or improve presto connectors, or other parts of the presto SQL engine. In addition to writing code, you could write documentation. 

Wrapping Up

Now you know how to find the Presto source code and understand the different forks that are out there. Hope to see you in the community!

Spark Streaming Alternatives

When researching Spark alternatives it really depends on your use case. Are you processing streaming data or batch data? Do you prefer an open or closed source/proprietary alternative?  Do you need SQL support?

With that in mind let’s look at ten closed-source alternatives to Spark Streaming first:

  1. Amazon Kinesis – Collect, process, and analyze real-time, streaming data such as video, audio, application logs, website clickstreams, and IoT telemetry. See also Amazon Managed Streaming for Apache Kafka (Amazon MSK).
  2. Google Cloud Dataflow – a fully-managed service for transforming and enriching streaming and batch data.
  3. Confluent – The leading streaming data platform. Built on Apache Kafka. 
  4. Aiven for Apache Kafka – A fully managed streaming platform, deployable in the cloud of your choice. Also 
  5. IBM Event Streams – A high-throughput, fault-tolerant, event streaming platform. Built on Kafka.
  6. Striim – a streaming data integration and operational intelligence platform designed to enable continuous query and processing and streaming analytics.
  7. Spring Cloud Data Flow – Tools to create complex topologies for streaming and batch data pipelines.  Features graphical stream visualizations
  8. Lenses – The data streaming platform that simplifies your streams with Kafka and Kubernetes.
  9. StreamSets – Brings continuous data to every part of your business, delivering speed, flexibility, resilience and reliability to analytics.
  10. Solace – A complete event streaming and management platform for the real-time enterprise. 

And here are five open source alternatives to Spark Streaming:

  1. Apache Flink – considered one of the best Apache Spark alternatives, Apache Flink is an open source platform for stream as well as the batch processing at scale. It provides a fault tolerant operator based model for streaming and computation rather than the micro-batch model of Apache Spark.
  2. Apache Beam – a workflow manager for batch and streaming data processing jobs that run on any execution engine. It executes pipelines on multiple execution environments.
  3. Apache Apex – Enterprise-grade unified stream and batch processing engine.
  4. Apache Samza – A distributed stream processing framework
  5. Apache Storm – distributed realtime computation system 

So there you have it. Hopefully you can now find a suitable alternative to Spark streaming.

Presto Concepts

What is Presto?

PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With Presto, you can query data where it lives across many different data sources such as HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third-party data sources or infrastructure components. 

As more organizations become data-driven, they need technologies like Presto to deliver ad-hoc analytics. Federated query engines like Presto simplify and unify data analytics on data anywhere. 

Is Presto a database?

No, Presto is not a database. You can’t store data in Presto and it would not replace a general-purpose relational database like MySQL, Oracle, or PostgreSQL.

What is the difference between PrestoDB and other forks?

Presto originated from Facebook and was built specifically for Facebook. PrestoDB is backed by Linux Foundation’s Presto Foundation and is the original Facebook open source project. 

Other versions of Presto are forks of the project and are not backed by the Linux Foundation’s Presto Foundation.

Is Presto In-Memory? 

Memory used by Presto is usually in the context of the JVMs itself, depending on query sizes and complexity of tasks you can allocate more or less memory to the JVMs. Presto itself, however, doesn’t use this memory to cache any data. 

How does Presto cache and store data?

Presto stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

What is the Presto query execution model?

The Presto query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When Presto executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple Presto workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Does Presto Use MapReduce?

Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, Presto will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.

What Is Presto In Big Data?

Big data encompasses many different things, including: 

  • Capturing data
  • Storing data
  • Analysis
  • Search
  • Sharing
  • Transfer
  • Visualization
  • Querying
  • Updating

Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. 

Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem.

What Is Presto Hive? 

Presto Hive typically refers to using Presto with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto via the Hive connector is able to access both these components. 

One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto Use Spark?

Presto and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.

Does Presto Use YARN?

Presto is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.

Presto Database & Engine Explained

What is Presto?

PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With Presto, you can query data where it lives across many different data sources such as HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third-party data sources or infrastructure components. 

As more organizations become data-driven, they need technologies like Presto to deliver ad-hoc analytics. Federated query engines like Presto simplify and unify data analytics on data anywhere. 

Is Presto a database?

No, Presto is not a database. You can’t store data in Presto and it would not replace a general-purpose relational database like MySQL, Oracle, or PostgreSQL.

What is the difference between PrestoDB and other forks?

Presto originated from Facebook and was built specifically for Facebook. PrestoDB is backed by Linux Foundation’s Presto Foundation and is the original Facebook open source project. 

Other versions of Presto are forks of the project and are not backed by the Linux Foundation’s Presto Foundation.

Is Presto In-Memory? 

Memory used by Presto is usually in the context of the JVMs itself, depending on query sizes and complexity of tasks you can allocate more or less memory to the JVMs. Presto itself, however, doesn’t use this memory to cache any data. 

How does Presto cache and store data?

Presto stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

What is the Presto query execution model?

The Presto query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When Presto executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple Presto workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Does Presto Use MapReduce?

Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, Presto will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.

What Is Presto In Big Data?

Big data encompasses many different things, including: 

  • Capturing data
  • Storing data
  • Analysis
  • Search
  • Sharing
  • Transfer
  • Visualization
  • Querying
  • Updating

Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. 

Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem.

What Is Presto Hive? 

Presto Hive typically refers to using Presto with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto via the Hive connector is able to access both these components. 

One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto Use Spark?

Presto and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.

Does Presto Use YARN?

Presto is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.

Women in Big Data Meetup:

An Introduction to Presto, an Open Source Distributed SQL Engine

with Dipti Borkar, Chief Product Officer at Ahana

In this presentation, Dipti will introduce the Presto technology and share why it’s becoming so popular – in fact, companies like Facebook, Uber, Twitter, Alibaba and many more use Presto for interactive ad hoc queries, reporting & dashboarding, data lake analytics, and much more including job opportunities. Read more about the event at Women in Big Data.

Speaker

Dipti Borkar
Cheif Product Officer, Ahana

Tech Talk Series: Getting Started with Presto

Level 103 for Presto: Deep Dive into the PrestoDB architecture at Twitter

Video & Slides

Video
Slides

Speaker

Chunxu Tang
Software Engineer, Twitter

Tech Talk Series: Getting Started with Presto

Level 102 for Presto: Getting Started with PrestoDB

Video & Slides

Video
Slides

Speakers

Da Cheng
Software Engineer, Twitter

Dipti Borkar
Co-Founder and Chief Product Officer, Ahana

Querying AWS Presto S3 data

While users may think of S3 buckets as a datasource for the Presto query engine, from a Presto perspective, file systems and object stores are handled very differently than databases. Presto only needs to have access to the path for the data it needs to scan. It does not need to know the details like bucket name etc as part of the configuration. It gets the path from the catalog. The catalog maps the file system / objects to databases and tables. There are two popular catalogs for object stores. Hive Metastore (cloud-agnostic) and AWS Glue (AWS specific). The Presto Hive Connector is used to access the data from the object store underneath. 

https://prestodb.io/docs/current/connector/hive.html

There are two primary ways to Presto clusters get access to data stored in S3:

  1. IAM role via the instance Presto servers are running on 
  2. Access Key / Secret Key provided via the Hive connector properties file

IAM role – recommended approach

If using IAM role, Presto needs to be configured using

hive.s3.use-instance-credentials=true 

With this setting, the Presto server will have access to all the buckets that are accessible using the  IAM role that the instance is bound to. The Hive Metastore running also needs to have access to those buckets and need to be bound to the same IAM role. 

Access Key / Secret Key 

If AK/SK is used, Presto Hive connector needs to be configured with: 

hive.s3.aws-access-keyDefault AWS access key to use.
hive.s3.aws-secret-keyDefault AWS secret key to use.

With this setting, the Presto server will have access to all the buckets that are accessible using this AK/SK when this catalog is selected. 

Getting Started with Presto on AWS and Docker: Video Demo Series

We’ve put together some video demo’s on getting started with PrestoDB in Docker and AWS.

We hope you find them helpful!


AWS Sandbox AMI for Presto

15 minutes

See the step-by-step tutorial >
Subscribe to the AWS Sandbox AMI for Presto by Ahana >


Sandbox Container for Presto on Docker

8 minutes

Go to Docker Sandbox >


Tech Talk Series: Getting Started with Presto

Level 101 for Presto: What is PrestoDB?

Video & Slides

Video
Slides

Speakers

Beinan Wang
Sr. Software Engineer, Twitter

Dipti Borkar
Co-Founder and Chief Product Officer, Ahana

On-Demand Virtual Round Table

Building the Presto Open Source Community

Q&A from the Panel

How’s Presto used at your company?

Uber: Presto is very widely used. It powers analytics on our data lake. And by last count, I think about half the company logs into Presto at least once a month to do something that’s important for their work. The power of this system comes out in some use cases that we have at Uber. So for instance, we have Pinot as a real time analytics tool. And we have Hadoop as our big data lake where we store our historical data. And through, for instance, Apache Superset, we are able to build a single dashboard, which allows our users to look at how the real time business is doing as well as what the historical trends are in a single place using a single interface, all powered by Presto. This highlights the incredible power that Presto is able to bring to the data analytics.

Alibaba:  Alibaba is the largest cloud service provider in China and in the Asia market. And so from the very beginning of the Alibaba Cloud, we do embrace and contribute back to the open source community to build up a lot of our services on cloud, for example, databases and big data analytics, all those are services on cloud. As for Presto, we leverage Presto to several of our key products on the cloud, for example, our Data Lake Analytics, which provides serverless federated analytics. We’re in a Big Data time, right. But now we think we are moving to the Fast Data Era which requires a lot of more effort to do analytics to different data sources in the real time manner.

Why is Presto so important and how does it help today’s user?

Bloor Group: The concept of a data lake came around in the big data, evolution. And now what you see is you will have companies that have a data warehouse or two or five, they’ll have a data lake or two or three, they’ll have lots of different information systems that they’re looking to leverage into use, but it no longer makes sense. You can’t pull all your data into a warehouse. You can’t even realistically pull up your data into a data lake and then expect good performance. So what’s been happening is Presto is this federated query engine that allows you to access all these different information sources where they are, and get that truly strategic view of what’s going on in your business.

Ahana: Data engineers spend a lot of time building pipelines to copy data from one source to another, making it accessible from one source to another, whether it’s from their operational system to s3, or a data warehouse, etc. 60% of their time is spent in moving data around. And what Presto does, is it beautifully solves this problem by being able to query in place and pushing some of the the core query logic, if you will, we call it push downs, to the data sources, pull that data in memory, and process it and return it back to the analyst or the technical product manager or the data driven person at the other end trying to make a decision.

In place querying and the ability to federate across many data sources and to be able to join across them is a very hard problem to solve, and Presto solves it very well because it’s based on ANSI SQL. And that’s one of the reasons it’s also become so popular is because you don’t have to relearn or re-integrate existing products that are out there, it already comes with standard JDBC/ODBC drivers, and it just works out of the box. In terms of the architecture, you get access to all these data sources. It’s processed in memory, as opposed to some of the previous generation technologies that were more disk based. That’s the other aspect that’s important, which is it’s an in memory system, with the ability to, to spill to disk, etc, for larger data sets. And the speed is important.

Uber: Presto abstracts away the user interface that analysts and data scientists use to understand the data using tools like Superset, Jupyter, Tableau, etc. It gives a single point of contact for our users using these front end interfaces to query data on all these systems. But even more important than a single point of contact, I think is a single abstraction layer. And by that what I mean is that you can query data on all these data sources using a single query language, which is Presto, and a single view of the data, which is basically a set of definitions within Presto, where you can take data from, let’s say, Hadoop, HDFS and Pinot and join them as if they were sitting in a single database, right. And Presto makes this all possible through the abstraction layers that it builds.

How can a developer get involved in the Presto project and community?

Uber: Through the website and the GitHub repository. You can download the code, it’s free, you can take a look at the code, it’s open source. And you can make modifications and contributions to the code by submitting a code request/pull request over GitHub. – you can just put them in, get it reviewed by a committer and make that contribution. And in the next release, your changes are there and you can just use them, and it’s available for everybody else to use. The GitHub repository, the community process, is the primary means where the vast majority of developers can engage.

Now if you want to get more involved, if you want to, for instance, be involved in a technical direction. Maybe you have some opinions or some use cases where deeper changes are required in the engine itself. Get involved with the technical steering committee – the meetings are open to the public. As you build your engagement with the community and your contributions to the code, you can even get into a conservatorship where you’re recognized as an engineering leader in the community. And you can also be part of the technical steering committee and help drive this direction if this is something that you want to do and engage with on a more full time basis. So in addition to these two venues, we also organize a lot of community events and engagements and seminars and talks, which are great ways to come and interact directly with other folks in the community.

Panelists

Girish Baliga
Engineering Manager – Data Infrastructure

Amit Chopra
Developer Advocate

Dipti Borkar
Co-Founder and Chief Product Officer

Yu Dong
Sr. Staff Engineer

Getting Perspective on Open Source Projects and Forks with Ted Dunning, Apache Software Expert

We spent some time with Apache Software Foundation board member, PMC member, and open source expert Ted Dunning to get his thoughts on the state of open source software. 

Can you start by telling us a bit about yourself and your open source background?

I’ve been involved with machine learning and big data systems for a very long time and have spent even longer working in the open source world. The first open source I was involved with was in the mid-70’s, which at the time was guys trading floppy disks in a coffee shop parking lot in Colorado. More recently I’ve been with the Apache Software Foundation (ASF) since 2007. 

What’s your general philosophy on open source and the community behind an OSS project?

In my view, it’s not just one part or a few individuals that create an open source community – it’s the community as a whole. Sure, you might have a few more vocal folks, but an open source community is only as good as its least vocal member. And the community is a great thing because it gives people the ability to be part of a community – a wonderfully electrifying experience. It’s a unifying experience, and it is the ultimate marketing. 

There are cases where people pretend to be open but really won’t really let anyone else play. If a community is done cynically, it can turn into vast bitterness. And it’s very common for companies to say “we employ the creators of this project”, but in fact I think that it is the community in aggregate who wrote it. Yes, a few people may have written the first lines of code and may have gotten the ball rolling, but we have to remember that the project is not just the source code. There are always a lot of different contributions to the creation of a project.

And you’ve been a part of several open source companies, MapR being the latest as the CTO. How did you strike the balance of open source and proprietary?

MapR was actually very, very interesting because it had open and closed parts. But they were good about finding a balance – it wasn’t just all proprietary software or all open source. I tried to be very publicly clear about our decisions about open sourcing some features and keeping others proprietary. One of the challenges of a vendor-backed open source technology is that you’re going to have a culture gap inside the company. You have people come from closed source, and you have people that come from open source, and neither side necessarily understands the other. But despite that, I felt much better being straightforward and honest about our decisions rather than pretending that everything was open.

When it comes to building open source communities around projects, I’m sure you’ve seen many times that things can go wrong. In some cases that might mean a project splits off, or forks, from the original project. In your experience, what can that mean for a community?

It’s so sad when communities split up. When these destructive forks happen, quite frankly it’s a huge tragedy for open source. The common characteristics of these cases are that a) everyone is just confused and b) there typically are major differences in worldviews of the same events on either side of the fork, making it very hard to get people together when there’s no shared understanding of what was happening. I see these two specific things happen when it comes to bad forks:

  1. Both projects keep the same name or too similar of a name, so no one knows the differences between the two
  2. Splitting the project splits the community, and this may render both irrelevant

If you split a community, a mailing list, a slack channel, there’s a reverse of synergy effect. A lot of good people who want to contribute to something but don’t really like conflict end up just leaving. People have plenty of distractions in their lives. And they have plenty of other opportunities to be creative, fun, and happy. Then like 10 years later they go, “Oh, I remember that, wonder what happened to it?”

What’s a specific example of a fork that went wrong? 

Sun’s Open Office/LibreOffice project split was a classic fork that split the community. As Sun’s contributions to the project declined, there were concerns from the Open Office community over whether it would remain under a neutral foundation. When Oracle acquired Sun, discussions of a fork began and as a result, LibreOffice was created and the community was split between the two projects. Unfortunately, it created a lot of conflict and negativity which harmed both products and the Open Office/LibreOffice community at large. Much of the community stopped contributing and maintaining either.

The Hudson/Jenkins fork is a fork that had a similar potential for problems, but essentially the entire community left the original Hudson branch and moved over to the Jenkins fork. Today, it is hard to see that there ever was a fork. Things that went right included distinctive naming (Hudson versus Jenkins) and a strong consensus about which fork to stay with moving forward (Jenkins). The original Hudson project moved to the Eclipse foundation, but has been dormant since 2017, leaving room for Jenkins to flourish.

What about projects that keep too similar of a name? We’re seeing that play out today with the Presto open source project with the original PrestoDB and the similarly-named fork PrestoSQL. What are your thoughts on this?

My initial reaction with Presto is that I always have to go look up which one is which. There’s PrestoSQL, PrestoDB, Presto Foundation, Presto Software Foundation and I can’t keep them straight. And so there’s massive confusion unless you’re one of the participants – as an outsider, I have no idea which one is which. From what I understand about the project, my guess is that the Presto creators who worked at Facebook didn’t make it apparent that they were going off to start a new community. They may not have been as open as they could have been about this, as sometimes happens when people leave a company. Later Facebook decided to put more effort behind the community, and it caused confusion because there’s now effectively two communities with nearly identical names.

But I think that the moral of the story is that when you name something, names matter. Anybody who forks should have a completely different name. That alone makes it easier for everyone to understand. Of course, it isn’t always clear which is the fork and which is the original, but at least you’d know if it was Presto in one place and Abracadabra in the other. Everybody would get it. It doesn’t even matter all that much which side gets which name.

Ahana Raises $2.25 Million Seed Funding Led by GV and Joins Linux Foundation’s Presto Foundation

Ahana to focus on growing the Presto community to simplify ad-hoc analytics with PrestoDB federated SQL engine

San Mateo, CA — June 2, 2020Ahana, the Presto-based analytics company, announced today it has raised $2.25 million in funding led by GV (formerly Google Ventures) along with participation from Leslie Ventures and other angel investors. Additionally, Ahana has joined the Linux Foundation’s Presto Foundation. Ahana will use the funding to expand its technical team and deliver Presto-based analytics products. Founded during the Coronavirus pandemic, Ahana is a remote from the start company.

Co-founded by Steven Mih (formerly Couchbase, Aviatrix, Alluxio) and Dipti Borkar (formerly IBM, Couchbase, Alluxio), Ahana brings together decades of cloud, open source, database and distributed systems experience to be the only commercial company focused on PrestoDB, the project hosted by the Linux Foundation’s Presto Foundation. Presto has become one of the fastest growing open source projects in the data analytics space since Facebook open sourced the technology in 2013.

“The data analytics market is evolving, as enterprises look for new opportunities to manage analytical data cost-effectively and at scale. There is a pronounced movement away from expensive, highly-structured data warehouses to ‘data lakes’ of raw data,” said Dave Munichiello, General Partner, GV. “We see an opportunity in companies who power simple and efficient analysis of this type of raw data and are drawn to Ahana’s vision that leverages the rapidly-growing Presto open source project.” 

In 2019, Facebook donated the Presto project to the Linux Foundation to create the Presto Foundation along with founding members Uber, Twitter and Alibaba. Ahana has joined the Linux Foundation’s Presto Foundation as a premier member. Ahana plans to improve ease of use, documentation and foster growth of the community users of the PrestoDB GitHub repository. PrestoDB is the most popular Presto project based on GitHub stars as compared to forks of the project, like PrestoSQL. Only PrestoDB follows the Linux Foundation’s widely accepted approach to project governance and transparency. 

“As two of our earliest members, Steven and Dipti have been strong supporters of the Presto Foundation since its launch in September 2019,” said Brian Hsieh, Head of Uber Open Source and Chairperson, Presto Foundation Governing Board. “It is our honor to welcome their new company, Ahana, to the Presto Foundation. We look forward to working together to advance our shared mission of building a collaborative, neutrally governed and open community to foster the growth of the PrestoDB ecosystem for users worldwide.” 

Today’s data-driven enterprises require technologies like Presto to enable faster insights into data. Presto is a widely adopted federated SQL engine for data platform teams to perform ad-hoc querying of data in place. As the Presto market continues to grow, Ahana’s mission is to simplify ad-hoc analytics.

“Federated query engines simplify and unify data analytics on data anywhere. But of all the engines, Presto is the fastest for ad-hoc querying of data in-place and is highly extensible with its pluggable API,” said Dipti Borkar, Co-founder and Chief Product Officer, Ahana. “With data spread across many sources and in different formats, as well as the proliferation of a disaggregated compute and storage stack, the sacred data warehouse star schema is no longer in one place. Thus, a federated query engine will become the foundation of architecture to meet the needs of modern data teams.” 

Supporting Quotes

“At Facebook, we believe open source accelerates the pace of innovation. By sharing our code, our stack, our designs, we’re hoping to provide the foundation for companies and individuals alike to scale and build great products. On the Facebook open source team, we sometimes spin-out projects to the community so that they can grow under a neutral governance structure, with Presto being the latest example of this. We’re thrilled to see that Ahana and other companies have the opportunity to use PrestoDB to build great products and businesses,” said Amit Chopra, Facebook’s representative on the Presto Foundation Board.

“The Linux Foundation is dedicated to building sustainable, governed, open source ecosystems to greatly benefit users globally, like what we have achieved with CNCF and the Kubernetes project. Presto Foundation was created to accelerate and foster the adoption of Presto, the original project donated by Facebook,” said Chris Aniszczyk, Vice President of Developer Relations, The Linux Foundation. “We are excited to welcome Ahana to the Presto Foundation. As the first and only company focused on supporting Presto of the Presto Foundation, we look forward to their continued contributions in building a thriving Presto community.” 

“There is no doubt that the future of analytics is pre-ordained by the petabyte-scale data companies like Facebook, Uber, and Twitter, who all run large clusters of PrestoDB. As the first PrestoDB company focused on simplifying ad-hoc analytics, Ahana is poised to corner this new gigantic market,” said Mark Leslie, Managing Director of Leslie Ventures, and Ahana Advisor.  

“PrestoDB is exploding in popularity with users around the world. That, combined with Ahana’s vision of simplifying the ad-hoc analytics journey for users, is extremely promising,” said Bob Wiederhold, former Couchbase CEO and Ahana Advisor. “I look forward to working closely with the co-founders Steven Mih and Dipti Borkar as they rejoin the Presto Foundation and community. While various forks of the original PrestoDB project exist, only PrestoDB has true governance, transparency and vendor neutrality authenticated by the Presto Foundation, under the Linux Foundation.” 

Supporting Resources

About Ahana

Ahana, the Presto-based analytics company, is the only company focused on unifying the PrestoDB community and bringing PrestoDB-focused ad-hoc analytics offerings to market. As the Presto market continues to grow exponentially, Ahana’s mission is to simplify interactive analytics as well as foster growth and evangelize the PrestoDB community. Founded in 2020, Ahana is headquartered in San Mateo, CA and operates as an all-remote company. Investors include GV and Leslie Ventures. Follow Ahana on LinkedIn, Twitter and PrestoDB Slack.

Using Presto with Hadoop

How does Presto work with Hadoop?

You use Presto to run interactive queries on Hadoop. The difference between using Presto versus something like Hive, for instance, is that Presto is optimized for fast performance – this is crucial for interactive queries.

Presto’s distributed system runs on Hadoop and uses an architecture that’s similar to a massively parallel processing (MPP) database management system. 

Presto does not have its own storage system, so it acts as complementary to Hadoop. It can be installed with any implementation of Hadoop, like Amazon’s EMR Hadoop distribution.

Is Presto related to Hive? What are the differences?

Apache Hive was developed as a project at Facebook in 2008 so they could leverage SQL syntax in their Hadoop system. It simplifies complex Java MapReduce jobs into SQL-like queries while executing jobs at massive scale.

As their Hadoop deployments grew, the Facebook team found that Hive wasn’t optimized for the fast performance they needed for their interactive queries. So they built Presto in 2013, which could operate quickly at petabyte scale.

At a high level, Hive is optimized for ad-hoc analysis and ease of use with its SQL-like syntax while Presto is highly optimized for low latency and fast performance – it takes longer for Hive to complete a job versus Presto.

Presto on AWS

Presto and AWS

Presto is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With Presto, you can query data where it lives across many different data sources.

If you want to run Presto in AWS, it’s easy to spin up a managed Presto cluster either via the AWS Management Console, the AWS CLI, or the Amazon EMR API. 

Running Presto in AWS gives you the flexibility, scalability, performance, and cost-effective features of the cloud while allowing you to take advantage of Presto’s distributed query engine. 

How does Presto work with AWS?

The two most popular places to deploy Presto in AWS are Amazon EMR and Amazon Athena. They’re managed services that do the integration, testing, setup, configuration, and cluster tuning for you.

AWS EMR enables you to provision as many compute instances as you want, in minutes. Amazon Athena lets you deploy Presto using the AWS Serverless platform, with no servers, virtual machines, or clusters to setup, manage, or tune.

What Is Presto? An Introduction To Presto

What is Presto?

PrestoDB is an open-source distributed SQL query engine for running interactive analytic queries against all types of data sources. It enables self-service ad-hoc analytics on large amounts of data. With Presto, you can query data where it lives across many different data sources such as HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third-party data sources or infrastructure components. 

As more organizations become data-driven, they need technologies like Presto to deliver ad-hoc analytics. Federated query engines like Presto simplify and unify data analytics on data anywhere. 

Is Presto a database?

No, Presto is not a database. You can’t store data in Presto and it would not replace a general-purpose relational database like MySQL, Oracle, or PostgreSQL.

What is the difference between PrestoDB and other forks?

Presto originated from Facebook and was built specifically for Facebook. PrestoDB is backed by Linux Foundation’s Presto Foundation and is the original Facebook open source project. 

Other versions of Presto are forks of the project and are not backed by the Linux Foundation’s Presto Foundation.

Is Presto In-Memory? 

Memory used by Presto is usually in the context of the JVMs itself, depending on query sizes and complexity of tasks you can allocate more or less memory to the JVMs. Presto itself, however, doesn’t use this memory to cache any data. 

How does Presto cache and store data?

Presto stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

What is the Presto query execution model?

The Presto query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When Presto executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple Presto workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Does Presto Use MapReduce?

Similar to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS, Presto will leverage its own mechanism to break down and fan out the work of a given query. It does not rely on MapReduce to do so.

What Is Presto In Big Data?

Big data encompasses many different things, including: 

  • Capturing data
  • Storing data
  • Analysis
  • Search
  • Sharing
  • Transfer
  • Visualization
  • Querying
  • Updating

Technologies in the big data space are used to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. 

Presto queries data. Competitors in the space include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL. Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem.

What Is Presto Hive? 

Presto Hive typically refers to using Presto with a Hive connector. The connector enables you to query data that’s stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto via the Hive connector is able to access both these components. 

One thing to note is that Hive also has its own query execution engine, so there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto Use Spark?

Presto and Spark are two different query engines. At a high level, Spark supports complex/long running queries while Presto is better for short interactive queries. This article provides a good high level overview comparing the two engines.

Does Presto Use YARN?

Presto is not dependent on YARN as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.