Presto and ETL – Clarifying Common Misconceptions

Data integration and ETL are typically seen as crucial to gaining insights from data. But when it comes to Presto, things get trickier: Should you use Presto for ETL? Or should you run ETL flows before querying data with Presto?

In this article, we will explore using Presto for ETL, how it compares to traditional data warehouse architectures, and whether ETL is necessary before querying data with Presto.

Looking for a better solution to run Presto in public clouds? Try Ahana for superior price-performance.

What is ETL in big data?

ETL (Extract, Transform, Load) is an iterative data integration process used in data warehousing to bring data from multiple sources into a single, centralized data repository. The process involves extracting data from its original source, cleaning and validating it, transforming it into a suitable format, and loading it into the data warehouse.

This process is necessary because data from different sources may have different formats and structures, and so it needs to be unified and organised before it can be used in data analysis and decision-making. ETL also allows for data to be regularly updated, so that the data warehouse is always up-to-date. Additionally, ETL tools allow data to be migrated between a variety of sources, destinations, and analysis tools, and enable companies to address specific business intelligence needs through data analytics

In big data, the target the data is being loaded into might be a data warehouse such as BigQuery or Redshift, but it might also be a data lake or data lakehouse — where the transformed data is stored in object storage such as Amazon S3. For instance, data might be loaded into an S3 data lake in its raw form and then further processed, cleaned and transformed into a format that is more suitable for analytical querying, such as Apache Parquet or ORC.

Using Presto for ETL

Presto is a distributed SQL query engine that is designed to be fast and responsive, allowing users to quickly analyze large datasets. It is an excellent choice for running interactive queries on large datasets, as it can execute queries in seconds or minutes, depending on the size of the dataset.

Even though Presto was designed to be an ad-hoc query engine, it can be a suitable choice if the ETL process does not require too many resources or is not too complex. However, it doesn’t guarantee results and lacks many “ETL” features common in tools designed to do ETL, such as caching.

It is worth noting that Presto is not a replacement for traditional ETL tools like Apache Airflow, which are designed specifically for complex ETL workflows. However, if your ETL process is relatively simple and you are already using Presto for data analysis, it may make sense to use Presto for ETL as well.

If you are using Presto for ETL, it is recommended to spin a separate Presto cluster to avoid resource contention with interactive Presto workloads. You should also break complex queries into a series of shorter ones and create temporary tables for manual checkpointing.

Do you need to ETL your data before reading it in Presto?

Presto allows for ad-hoc querying of data wherever it resides without the need for ETL into a separate system. Using Presto connectors, analysts can access datasets that they have interest in, while in-place execution means queries are retrieved quickly — even when querying data directly from cloud object storage. This makes it much easier for data teams to access and analyze data in real-time, without having to wait for ETL processes to complete. It also helps to reduce the cost of data storage, as it eliminates the need to duplicate data in a separate system.

For most Presto use cases, there is no need to pre-process data with ETL or ELT before querying it. As soon as the data is stored in the data lake, it can be accessed with Presto SQL. This is almost always the case with interactive analytics, but there could be exceptions when it comes to scheduled reporting or BI dashboards – in these situations, you might consider using aggregated or pre-summarized datasets to reduce compute costs.

Comparing Presto to Data Warehouses

The ‘zero ETL’ approach that Presto enables is one of its core advantages over traditional data warehouse architectures. ETL/ELT can involve a lot of manual work, data duplication, and errors, which can make analytics slower and more complex.

PrestoDB’s ability to run analytical queries on diverse data sources and raw semi-structured data can significantly expedite the process. Presto eliminates the need to load data into a data warehouse, since it can be queried directly from its source and schema changes can be implemented in real-time; this also saves the need to perform costly transformations in order to apply a relational schema to file-based data.

If you’re interested in trying out Presto in the cloud, Ahana Cloud is a SaaS for Presto that makes it easy to deploy Presto in the cloud and run queries on your data. Check out a demo today.