The term Data Lakehouse has become very popular over the last year or so, especially as more customers are migrating their workloads to the cloud. This article will help to explain what a Data Lakehouse architecture is, and how companies are using the Data Lakehouse in production today. Finally, we’ll share a bit on where Ahana Cloud for Presto fits into this architecture and how real companies are leveraging Ahana as the query engine for their Data Lakehouse.
What is a Data Lakehouse?
First, it’s best to explain a Data Warehouse and a Data Lake.
A data warehouse is one central place where you can store specific, structured data. Most of the time that’s relational data that comes from transactional systems, business apps, and operational databases. You can run fast analytics on the Data Warehouse with very good price/performance. Using a data warehouse typically means you’re locked into that Data Warehouse’s proprietary formats – the trade off for the speed and price/performance is your data is ingested and locked into that warehouse, so you lose the flexibility of a more open solution.
On the other hand, a Data Lake is one central place where you can store any kind of data you want – structured, unstructured, etc. – at scale. Popular Data Lakes are AWS S3, Microsoft Azure, and Google Cloud Storage. Data Lakes are widely popular because they are very cheap and easy to use – you can literally store an unlimited amount of any kind of data you want at a very low cost. However, the data lake doesn’t provide built-in mechanisms like query, analytics, etc. You need a query engine and data catalog on top of the data lake to query your data and make use of it (that’s where Ahana Cloud comes in, but more on that later).
Now let’s look at the Data Lake vs the Lakehouse. This new data lakehouse architecture has emerged that takes the best of the Data Warehouse and Data Lake. That means it’s open, flexible, has good price/performance, and can scale like the Data Lake, and can also do transactions and have strong security like that of the Data Warehouse.
Data Lakehouse Architecture Explained
Here’s an example of a Data Lakehouse architecture:
You’ll see the key components include your Cloud Data Lake, your catalog & governance layer, and the data processing (SQL query engine). On top of that you can run your BI, ML, Reporting, and Data Science tools.
There are a few key characteristics of the Data Lakehouse. First, it’s based on open data formats – think ORC, Parquet, etc. That means you’re not locked into a proprietary format and can use an open source query engine to analyze your data. Your lakehouse data can be easily queried with SQL engines.
Second, a governance/security layer on top of the data lake is important to provide fine-grained access control to data. Last, performance is critical in the Data Lakehouse. To compete with data warehouse workloads, the data lakehouse needs a high-performing SQL query engine on top. That’s where open source Presto comes in, which can provide that extreme performance to give you similar, if not better, price/performance for your queries.
Building your Data Lakehouse with Ahana Cloud for Presto
At the heart of the Data Lakehouse is your high-performance SQL query engine. That’s what enables you to get high performance analytics on your data lake data. Ahana Cloud for Presto is SaaS for Presto on AWS, a really easy way to get up and running with Presto in the cloud (it takes under an hour). This is what your Data Lakehouse architecture would look like if you were using Ahana Cloud:
Ahana comes built-in with a data catalog and caching for your S3-based data lake. With Ahana you get the capabilities of Presto without having to manage the overhead – Ahana takes care of it for you under the hood. The stack also includes and integrates with transaction managers like Apache Hudi, Delta Lake, and AWS Lake Formation.
We shared more on how to unlock your data lake with Ahana Cloud in the data lakehouse stack in a free on-demand webinar.
Ready to start building your Data Lakehouse? Try it out with Ahana. We have a 14-day free trial (no credit card required), and in under 1 hour you’ll have SQL running on your S3 data lake.
The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Check out this article for more information about data warehouses.