Building an Open Data Lakehouse with Presto, Hudi and AWS S3

Reporting and dashboarding diagram

The Open Data Lakehouse – a quick intro

Data warehouses have been considered a standard to perform analytics on structured data but cannot handle unstructured data such as text, images, audio, video and other formats. Additionally, machine learning and AI are becoming common in every aspect of business and they need access to vast amounts of data outside of data warehouses.

The cloud transformation has triggered the disaggregation of compute and storage which leverages cost benefits and enables adaptability to store data coming from multiple dimensions. All this has led to a new data platform architecture called the Open Data Lakehouse. It solves the challenges of the traditional cloud data warehouse through its use of open source and open format technologies such as Presto and Hudi. In this blog you will learn more about open data lake analytics stack using Presto, Hudi, and AWS S3. 

What is an Open Data Lakehouse

The Open Data Lakehouse is based on the concept of bringing your warehouse workloads to the data lake. You can run analytics on technology and tools that do not require any vendor lock-in including licensing, data formats, interfaces and infrastructure.

Four key elements include:

Open source – The technologies on the stack we will be exploring for Open Data lake Analytics are completely open source under the Apache 2.0 license. This means that you benefit from the best innovations, not just from one vendor but from the entire community. 

Open formats – Also they don’t use any proprietary formats. In fact, it supports most of the common formats like JSON, Apache ORC, Apache Parquet and others.

Open interfaces – The interfaces are industry standard ANSI SQL compatible and standard JDBC / ODBC drivers can be used to connect to any reporting / dashboarding / notebook tool. And because it is open source, industry standard language clauses continue to be added in and expanded on. 

Open cloud – The stack is cloud agnostic and without storage natively aligns with containers and can be run on any cloud. 

Why Open Data Lakehouse

Open data lakehouses allow consolidation of structured and unstructured data in a central repository at cheaper cost and removes the complexity of running ETL, resulting in high performance and reducing cost and time to run analytics.

  • Bringing compute to your data (decouple of compute and storage)
  • Flexibility at the governance/transaction layer
  • Flexibility and low cost to store structured and semi/unstructured data
  • Flexibility at every layer – pick and choose which technology works best for your workloads/use case

Open Data Lakehouse architecture

Now let’s dive into the stack itself and each of the layers. We’ll discuss what problems each layer solves for.

The next EDW is the Open Data Lakehouse. Learn the data lakehouse format.

BI/Application tools – Data Visualization, Data Science tools

Plug in your BI/analytical application tool of choice. The Open Data Lake Analytics stack supports the use of JDBC/ODBC drivers so you can connect Tableau, Looker, preset, jupyter notebook, etc. based on your use case and workload. 

Presto – SQL Query Engine for the Data Lake

Presto is a parallel distributed SQL query engine for the data lake. It enables interactive, ad-hoc analytics on large amounts of data on data lakes. With Presto you can query data where it lives, including data sources like AWS S3, relational databases, NoSQL databases, and some proprietary data stores. 

Presto is built for high performance interactive querying with in-memory execution 

Key characteristics include: 

  • High scalability from 1 to 1000s of workers
  • Flexibility to support a wide range of SQL use cases
  • Highly pluggable architecture that makes it easy to extend Presto with custom integrations for security, event listeners, etc.
  • Federation of data sources particularly data lakes via Presto connectors
  • Seamless integration with existing SQL systems with ANSI SQL standard
deploying presto clusters

A full deployment of Presto has a coordinator and multiple workers. Queries are sub‐ mitted to the coordinator by a client like the command line interface (CLI), a BI tool, or a notebook that supports SQL. The coordinator parses, analyzes and creates the optimal query execution plan using metadata and data distribution information. That plan is then distributed to the workers for processing. The advantage of this decoupled storage model is that Presto is able to provide a single view of all of your data that has been aggregated into the data storage tier like S3.

Apache Hudi – Streaming Transactions in the Open Data Lake

One of the big drawbacks in traditional data warehouses is keeping the data updated. It requires building data mart/cubes then doing constant ETL from source to destination mart, resulting in additional time, cost and duplication of data. Similarly, data in the data lake needs to be updated  and consistent without that operational overhead. 

A transactional layer in your Open Data Lake Analytics stack is critical, especially as data volumes grow and the frequency at which data must be updated continues to increase. Using a technology like Apache Hudi solves for the following: 

  • Ingesting incremental data
  • Changing data capture, both insert and deletion
  • Incremental data processing
  • ACID transactions

Apache Hudi, which stands for Hadoop Upserts Deletes Incrementals, is an open-source based transaction layer with storage abstraction for analytics developed by Uber. In short, Hudi enables atomicity, consistency, isolation, and durability (ACID) transactions in a data lake. Hudi uses open file formats Parquet and Avro for data storage and internal table formats known as Copy-On-Write and Merge-On-Read.

It has built-in integration with Presto so you can query “hudi datasets” stored on the open file formats.

Hudi Data Management

Hudi has a table format which is based on directory structure and the table will have partitions, which are folders containing data files for that partition. It has indexing capabilities to support fast upserts. Hudi has two table types defining how data is indexed and layed out which defines how the underlying data is exposed to queries.

Hudi data management

(Image source: Apache Hudi)

  • Copy-On-Write (COW): Data is stored in Parquet file format (columnar storage), and each new update creates a new version of files during a write. Updating an existing set of rows will result in a rewrite of the entire parquet files for the rows being updated.
  • Merge-On-Read (MOR): Data is stored in a combination of Parquet file format (columnar) and Avro (row-based) file formats. Updates are logged to row-based delta files until compaction, which will produce new versions of the columnar files.

Based on the two table types Hudi provides three logical views for querying data from the Data Lake.

  • Read-optimized – Queries see the latest committed dataset from CoW tables and the latest compacted dataset from MoR tables
  • Incremental – Queries see new data written to the table after a commit/compaction. This helps to build incremental data pipelines and it’s analytics.
  • Real-time – Provides the latest committed data from a MoR table by merging the columnar and row-based files inline

AWS S3 – The Data Lake

The data lake is the central location for storing data from disparate sources such as structured, semi-structured and unstructured data and in open formats on object storage such as AWS S3.

Amazon Simple Storage Service (Amazon S3) is the de facto centralized storage to implement Open Data Lake Analytics.

Getting Started:

How to run Open data lake analytics workloads using Presto to query Apache Hudi datasets on S3

Now that you know the details of this stack, it’s time to get started. Here I’ll quickly show how you can actually use Presto to query your Hudi datasets on S3.

Ingest your data into AWS S3 and query with Presto

Data can be ingested on Data lake from different sources such as kafka and other databases, by introducing hudi into the data pipeline the needed Hudi tables will be created/updated and the data will be stored in either Parquet or Avro format based on the table type in S3 Data Lake. Later BI Tools/Application can query data using Presto which will reflect updated results as data gets updated.

Conclusion:

The Open Data Lake Analytics stack is becoming more widely used because of its simplicity, flexibility, performance and cost.

The technologies that make up that stack are critical. Presto, being the de-facto SQL query engine for the data lake, along with the transactional support and change data capture capabilities of Hudi, make it a strong open source and open format solution for data lake analytics but a missing component is Data Lake Governance which allows to run queries on S3 more securely. AWS has recently introduced Lake formation, a data governance solution for data lake and Ahana, a managed service for Presto seamlessly integrates Presto with AWS lake formation to run interactive queries on your AWS S3 data lakes with fine grained access to data.

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

How to Build a Data Lake Using Lake Formation on AWS

AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach.