What is an Open Data Lake in the Cloud?

Data Driven Insights diagram

Problems that necessitate a data lake

In today’s competitive landscape, companies are increasingly leveraging their data to make better decisions, providing value to their customers, and improving their operations. Data-driven insights can help business and product leaders hone in on customer needs and/or find untapped opportunities. Also, analytics dashboards can be presented to customers for added value. Traditionally, insights are gleaned from rather small amounts of enterprise data which is what you’d expect – historical information about products, customers, and sales. But now, the modern business must deal with 1000s of times more data, which encompasses more types of data and is far beyond Enterprise Data. Examples include 3rd party data feeds, IoT sensor data, event data, geospatial and other telemetry data.

The problem with having 1000s of times the data is that databases, and specifically data warehouses, can be very expensive. And data warehouses are optimized to handle relational data with a well-defined structure and schema. As both data volumes and usage grow, the costs of a data warehouse can easily spiral out of control. Those costs, coupled with the inherent lock-in associated with data warehouses, have left many companies looking for a better solution, either augmenting their enterprise data warehouse or moving away from them altogether. 

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture. 

The Open Data Lake in the cloud centers on S3-based object storage. In AWS, there can be many S3-buckets across an organization. In Google Cloud, there is a service called Google Cloud Store (GCS) and in Microsoft Azure it is called Azure blob store. The data lake can store the relational data that typically comes from business apps like the data warehouse stores. But the data lake also stores non-relational data from a variety of sources as mentioned above. The data lake can store structured, semi-structured, and/or unstructured data.

With all this data stored in the data lake, companies can run different types of analytics directly, such as SQL queries, real-time analytics, and AI/Machine Learning. A metadata catalog of the data enables the analytics of the non-relational data. 

Why Open for Data Lakes

As mentioned, companies have the flexibility to run different types of analytics, using different analytics engines and frameworks. Storing the data in open formats is the best-practice for companies looking to avoid the lock-in of the traditional cloud data warehouse. The most common formats of a modern data infrastructure are open, such as Apache Parquet and ORC. They are designed for fast analytics and are independent of any platform. Once data is in an open format like Parquet, it would follow to run open source engines like Presto on it. Ahana Cloud is a Presto managed service which makes it easy, secure, and cost efficient to run SQL on the Open Data Lake. 

If you want to learn more about why you should be thinking about building an Open Data Lake in the cloud, check out our free whitepaper on Unlocking the Business Value of the Data Lake – how open and flexible cloud services help provide value from data lakes.

Helpful Links

Best Practices for Resource Management in PrestoDB

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

5 main reasons Data Engineers move from AWS Athena to Ahana Cloud