Presto and ETL – Clarifying Common Misconceptions

Data integration and ETL are typically seen as crucial to gaining insights from data. But when it comes to Presto, things get trickier: Should you use Presto for ETL? Or should you run ETL flows before querying data with Presto?

In this article, we will explore using Presto for ETL, how it compares to traditional data warehouse architectures, and whether ETL is necessary before querying data with Presto.

Looking for a better solution to run Presto in public clouds? Try Ahana for superior price-performance.

What is ETL in big data?

ETL (Extract, Transform, Load) is an iterative data integration process used in data warehousing to bring data from multiple sources into a single, centralized data repository. The process involves extracting data from its original source, cleaning and validating it, transforming it into a suitable format, and loading it into the data warehouse.

This process is necessary because data from different sources may have different formats and structures, and so it needs to be unified and organised before it can be used in data analysis and decision-making. ETL also allows for data to be regularly updated, so that the data warehouse is always up-to-date. Additionally, ETL tools allow data to be migrated between a variety of sources, destinations, and analysis tools, and enable companies to address specific business intelligence needs through data analytics

In big data, the target the data is being loaded into might be a data warehouse such as BigQuery or Redshift, but it might also be a data lake or data lakehouse — where the transformed data is stored in object storage such as Amazon S3. For instance, data might be loaded into an S3 data lake in its raw form and then further processed, cleaned and transformed into a format that is more suitable for analytical querying, such as Apache Parquet or ORC.

Using Presto for ETL

Presto is a distributed SQL query engine that is designed to be fast and responsive, allowing users to quickly analyze large datasets. It is an excellent choice for running interactive queries on large datasets, as it can execute queries in seconds or minutes, depending on the size of the dataset.

Even though Presto was designed to be an ad-hoc query engine, it can be a suitable choice if the ETL process does not require too many resources or is not too complex. However, it doesn’t guarantee results and lacks many “ETL” features common in tools designed to do ETL, such as caching.

It is worth noting that Presto is not a replacement for traditional ETL tools like Apache Airflow, which are designed specifically for complex ETL workflows. However, if your ETL process is relatively simple and you are already using Presto for data analysis, it may make sense to use Presto for ETL as well.

If you are using Presto for ETL, it is recommended to spin a separate Presto cluster to avoid resource contention with interactive Presto workloads. You should also break complex queries into a series of shorter ones and create temporary tables for manual checkpointing.

Do you need to ETL your data before reading it in Presto?

Presto allows for ad-hoc querying of data wherever it resides without the need for ETL into a separate system. Using Presto connectors, analysts can access datasets that they have interest in, while in-place execution means queries are retrieved quickly — even when querying data directly from cloud object storage. This makes it much easier for data teams to access and analyze data in real-time, without having to wait for ETL processes to complete. It also helps to reduce the cost of data storage, as it eliminates the need to duplicate data in a separate system.

For most Presto use cases, there is no need to pre-process data with ETL or ELT before querying it. As soon as the data is stored in the data lake, it can be accessed with Presto SQL. This is almost always the case with interactive analytics, but there could be exceptions when it comes to scheduled reporting or BI dashboards – in these situations, you might consider using aggregated or pre-summarized datasets to reduce compute costs.

Comparing Presto to Data Warehouses

The ‘zero ETL’ approach that Presto enables is one of its core advantages over traditional data warehouse architectures. ETL/ELT can involve a lot of manual work, data duplication, and errors, which can make analytics slower and more complex.

PrestoDB’s ability to run analytical queries on diverse data sources and raw semi-structured data can significantly expedite the process. Presto eliminates the need to load data into a data warehouse, since it can be queried directly from its source and schema changes can be implemented in real-time; this also saves the need to perform costly transformations in order to apply a relational schema to file-based data.

If you’re interested in trying out Presto in the cloud, Ahana Cloud is a SaaS for Presto that makes it easy to deploy Presto in the cloud and run queries on your data. Check out a demo today.

Presto REST API

Presto’s HTTP API (Presto Rest API) is the communication protocol between server and client. It’s used to send query statements for execution on the server and to receive results back to the client. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for details and usage notes. 

As an example, you can make a simple REST call to Presto to get a JSON dump of recently run queries using this syntax:

 http://<prestoServerHost>:<port>/v1/query

The default port for Presto is 8080 for non-secure clusters.

You can optionally specify a query ID – in this example the query ID I’m interested in is 20200926_204458_00000_68x9u:

https://myHost:443/v1/query/20200926_204458_00000_68x9u

The default port for secure Presto clusters is 443. 

Here’s a worked example using curl, specifying a user ID/password. Stats for the simple test query – select from now() – can be seen:

$ curl https://myHost:443/v1/query -u james
Enter host password for user 'james': *******


[{"queryId":"20210119_192148_00000_r4adv","session":{"queryId":"20210119_192148_00000_r4adv","transactionId":"26387f6e-6f5b-41a8-bac1-2fc4fed51e04","clientTransactionSupport":true,"user":"james","principal":"james","source":"presto-cli","timeZoneKey":2072,"locale":"en_GB","remoteUserAddress":"xxx.xxx.xxx.xxx","userAgent":"StatementClientV1/0.245.1-9966d7d","clientTags":[],"resourceEstimates":{},"startTime":1611084108707,"systemProperties":{},"catalogProperties":{},"unprocessedCatalogProperties":{},"roles":{},"preparedStatements":{}},"resourceGroupId":["global"],"state":"FINISHED","memoryPool":"general","scheduled":true,"self":"http://xxx.xxx.xxx.xxx:8585/v1/query/20210119_192148_00000_r4adv","query":"select now()","queryStats":{"createTime":"2021-01-19T19:21:49.034Z","endTime":"2021-01-19T19:21:50.014Z","queuedTime":"21.28ms","elapsedTime":"980.19ms","executionTime":"916.70ms","totalDrivers":17,"queuedDrivers":0,"runningDrivers":0,"completedDrivers":17,"rawInputDataSize":"0B","rawInputPositions":0,"cumulativeUserMemory":0.0,"userMemoryReservation":"0B","totalMemoryReservation":"0B","peakUserMemoryReservation":"0B","peakTotalMemoryReservation":"0B","peakTaskTotalMemoryReservation":"0B","totalCpuTime":"25.00ms","totalScheduledTime":"78.00ms","fullyBlocked":true,"blockedReasons":[],"totalAllocation":"0B","progressPercentage":100.0},"queryType":"SELECT","warnings":[]}]

Presto (and PrestoDB) Architecture and Design

The basics

The Presto architecture and design seamlessly integrates with an existing data ecosystem without needing to modify it. It’s an additional faster data access interface that turbocharges your existing stack. Presto separates compute from storage; its federated query engine lets teams run analytics for a variety of use cases across a wide range of data lakes and databases in-place, with no need to move data. It doesn’t store the data, which has the big advantage of being able to scale resources for queries up and down based on the demand. 

Presto supports a decoupled architecture

Most of the cloud deployments leverage object storage, which is already separated from the compute layer, and auto-scale to optimize resource costs. So the separation of compute and storage make it very suitable for cloud environments–though Presto can be deployed both in the cloud and on premises. 

Learn more about how Presto works with Amazon S3.

Presto nodes and how PrestoDB handles querying

Presto’s architecture is very similar to a classic database management system using cluster computing (MPP). Presto’s architecture consists of one coordinator node working together with multiple worker nodes. Clients enter their SQL queries that then are parsed, analyzed and planned with parallel tasks scheduled out to workers. Workers jointly process rows from the data sources and produce results that the client receives back. Presto does not write intermediate results to disk as its precursor Apache Hive did, and so you get significantly faster query speed. For Presto programming, it’s written in the Java programming language.

Presto data sources and connectors

A single Presto query can combine data from multiple sources. Presto uses connectors to access various data sources, some of the more popular combinations include Presto HDFS, Presto Amazon S3, Presto MySQL, Presto PostgreSQL, Presto Microsoft SQL Server, Presto Redis, Presto Kudu, Amazon Redshift, Apache Phoenix, Apache Kafka, Apache Cassandra, Apache Accumulo, MongoDB and Redis. A connector is similar in concept to a driver for a database. Several of the connectors are built-in to Presto, namely for Hive, JMX, TPC-H benchmark data, and a system connector which provides access to built-in system tables. Presto accesses other data sources using third-party connectors, and more and more are becoming available.

Presto Data Sources

Presto was designed and written from the ground up to efficiently query data against data sources of all sizes, ranging from gigabytes to petabytes. Presto connects to a wide variety of data sources, from HDFS to traditional relational databases, as well as NoSQL data sources such as Cassandra. Presto is particularly equipped to perform multiple concurrent interactive queries against a data source.

Presto is obviously a great fit for companies that have disparate data sources. For those organizations that can’t consolidate all of their data into one centralized store, Presto’s data federation capabilities can create a unified query layer that enables you to blend your data across different data sources together. With Presto, you can leverage many data sources at once, which means Presto can handle very large volumes of data.

Some of popular combinations include Presto AWS S3, Presto Cassandra/cassandra presto, presto accumulo, and more. Here are more data sources Presto connects to:

Accumulo

Alluxio

Amazon Redshift

Amazon S3

Cassandra

Druid

Elastic

HDFS

Kafka

Kudu

Microsoft SQL Server

MongoDB

Phoenix

Pinto

RDS PostgreSQL

RDS MySQL

Redis

Teradata

If you want to get up and running with Presto quickly, check out Ahana Cloud which is SaaS for Presto.

Presto Server Types

When it comes to Presto Server Types, Coordinators and workers are the two types of Presto servers. Here’s a look at each server type in more detail.

A Presto coordinator parses statements, plans queries, and manages Presto worker nodes. A Presto installation is also the node to which a client connects to submit statements for execution. Every Presto installation must have a Presto coordinator alongside one or more Presto workers. For development or testing purposes, a single instance of Presto can be configured to perform both roles.

The coordinator keeps track of the activity on each worker and coordinates the execution of a query. The coordinator creates a logical model of a query involving a series of stages, which are then translated into a series of connected tasks running on a cluster of Presto workers.

A Presto worker is actually a server. Its basic job is to execute tasks assigned by the coordinator and to process data. When it starts working, it alerts the discovery server, which makes it available to the coordinator. The worker then obtains the data from connectors and exchanges data with the connectors. The coordinator then takes the results from the worker and sends those results back to the client.

Workers use an HTTP-based protocol to communicate with both the coordinator and other workers.

Is Presto Free?

Many folks may wonder “is Presto free”? In fact, PrestoDB is a free, open source federated, distributed SQL query engine used for ad hoc analytics. The PrestoDB AMI is 100% open source and available for use in production immediately. PrestoDB AMI is the first and only completely open source and completely free edition of the software.

At Ahana, we have a team of PrestoDB experts who are actively contributing to the open source project and community. Our mission is to help you be successful with your PrestoDB deployments. 

You can sign up for Ahana Cloud for Presto, the first full integrated, cloud-native managed service for Presto. It is the easiest Presto experience ever, and requires no installation, no AWS AMIs or CFTs and no configuration. You can be running SQL queries on your AWS data within 30 minutes. You can sign up for free here.

Presto Catalogs

Presto has several important components that allow you to easily manage data. These components are catalogs, tables and schemas, and connectors. Presto accesses data via connectors; each data source is configured as a catalog, and you can query as many catalogs as you want in each query. The catalogs contain schemas and information about where data is located. Every Presto catalog is associated with a specific Presto connector. Keep in mind that more than one catalog can use the same connector to access different instances of the same data source. 

Catalogs are defined in properties files stored in the Presto configuration directory. Schema is what you use to organize your tables. Catalogues and schemas are how you define what will be queried. When addressing a table in Presto, the fully-qualified table name is always rooted in a catalog. For example, a fully-qualified table name of hive.test_data.test would refer to the test table in the test_data schema in the hive catalog.

If you run a SQL statement in Presto, you are running it against one or more catalogues. For example, you can configure a JMX catalog to provide access to JMX information via the JMX connector. Other examples of catalogs include the Hive catalog to connect to a Hive data source.

You can have as many catalogs as you need. For example, if you have additional Hive clusters, you simply add additional properties file to etc/catalog with a different name, making sure it ends in .properties. For example, if you name the property file sales.properties, Presto creates a catalog named sales using the configured connector.