Data Lake IO Caching for Ahana Cloud

One-click caching built-in to every Presto cluster for up to 5x query performance improvements

Data lake IO caching, built on the RubiX open source project, eliminates the need to read data from data lakes like AWS S3, particularly if the same data is read over and over. 

Ahana Cloud users can easily enable data lake caching with the click of a button when creating a new Presto cluster. The rest, including attaching SSDs and sizing the cache based on the user selected instance type, is all done automatically.

Ready to speed up your Presto queries?
Sign up for a free trial of Ahana Cloud today 😀

Benefits

85%

latency reductions for concurrent workloads

5x

Query performance

Columnar caching for faster warmup times

RubiX is a FileSystem Cache and works with files and byte ranges. The advantage is that only byte ranges that are required are read and cached. Such a system works well with columnar format readers like ORC and Parquet which request only certain columns stored in specific byte ranges.

Engine Independent Scheduling Logic

RubiX decides which node will handle a particular split of the file and always returns the same node for that split to the scheduler, allowing schedulers to use locality-based scheduling. RubiX uses consistent hashing to figure out where the block should reside. Consistent hashing allows us to bring down the cost of rebalancing the cache when nodes join or leave the cluster, e.g. during auto-scaling.

Get analytics on your data faster

You no longer have to read data from the data lake itself; data is cached in each Presto Worker node in the Ahana in-VPC compute plane.

Shared Cache across JVMs

RubiX has a Cache Manager on every node which manages all the blocks that are cached. Every job whether it runs in the same JVM (Presto) or separate JVMs (Spark) refers to the cache manager to read blocks.

👉 What’s RubiX?
RubiX is an open source light-weight data caching framework that can be used by SQL-on-Hadoop engines and is designed to work with data stores like AWS S3. 

Benchmarks

With Data Lake IO Caching, users see up to 85% latency reductions for concurrent workloads

Mixed Workload 6

  • Combination of 4 queries 
  • Q2 x 10 times , Q3 x 7 times, Q1 x12 times
  • 650 Billion Rows

TPC-H Benchmarking Results

Ready to get started?
Sign up for a free trial of Ahana Cloud today 😀