Lakehouse solutions are becoming more and more popular as an augmentation or replacement for expensive locked-in data warehouses. However, many organizations still struggle with the cost of these implementations. Let’s discuss how to reduce the cost of your lakehouse solution. We will look at the drivers of cost and how open source can help reduce expenses. We will also examine the biggest cost drivers in a lakehouse solution and how they can be mitigated.
Leveraging an open data lakehouse offers countless advantages, from its distinct compute and storage architecture to a lack of vendor lock-in. Plus, you gain the freedom to opt for whichever engine is best suited for your needs and cut costs along the way!
As separated storage has become more affordable and available, the compute engines have been a major driving cost factor associated with data lakehouses. When building a data lakehouse, storage, metadata catalog, and table/data management are not the components that drive significant increase in costs. Compute, however, is a major factor as the number of jobs and queries that need to be executed continue to increase, necessitating more hardware and software, and increasing costs significantly
Fortunately, the majority of distributed computing engines, like Presto, are available as open source software that can be used absolutely free! All you have to pay for are servers or cloud instances. Although all computation engines share similar functions, certain ones have a more optimized design due to their underlying technology. These are far more powerful than others in terms of efficiency, resulting in significant cost saving due to the lower volume of servers required.
The open source presto engine is very efficient and becoming more efficient as the compute workers leverage native C++ vectorization technologies. Compared with other systems that run a Java virtual machine, native C++ code is drastically more efficient. The reason C++ is faster than Java is because C++ is compiled, whereas Java is interpreted. By utilizing C++, developers can take advantage of superior memory allocation for increased speed and efficiency. Java Virtual Machine (JVM) is susceptible to infamous garbage collection storms whereas C++ does not suffer from this issue. A perfect example of this is Apache SparkSQL, which leverages Java Virtual Machines as well as Databrick’s recently introduced proprietary Photon engine that utilizes C++.
Running AWS Lakehouse with Presto can potentially reduce your compute cost by ⅔. Let’s take a look at a sample comparison of running an AWS Lakehouse with another solution vs. with Presto. Let’s consider a 200TB lakehouse with 20 nodes of Presto, using current AWS pricing (December 2022): 20 X r5.8xl instances = $40/hour
If used for 30 days, the compute would be $29K per month.
200TB of S3 per month = $4K per month.
Setting aside the data transfer charges, you’ll be spending 88% on the compute.
So if you have a compute engine that is 3 times more efficient, you would end up with 1/3 the compute nodes for the same workload:
7 X r5.8xl instances = $14/hour
If used for 30 days, the compute would be $10K per month.
200TB of S3 per month = $4K per month.
Furthermore, though it does not account for the insignificant data transfer or metadata fees, these are negligible.
So the comparison would be $33K vs. $14K
The total savings would be on the order of 60% cost savings.
Next: Learn more about Data Warehouse vs Data Mesh vs Data Lakehouse