Configuring RaptorX – a multi-level caching with Presto

Multi-level-Data-Lake-Cashing-with-RaptorX

RaptorX Background and Context

Meta introduced a multi-level cache at PrestoCon 2021. Code-named the “RaptorX Project,” it aims to make Presto 10x faster on Meta- scale petabyte workloads. Here at Ahana, engineers have also been working on RaptorX to help make it  usable for the community by fixing a few open issues, tuning and testing heavily with other workloads. This is a unique and very powerful feature only available in PrestoDB and not any other versions or forks of the Presto project.

Presto is the disaggregated compute-storage query engine, which helps customers and cloud providers scale independently and reduce costs. However, storage-compute disaggregation also brings new challenges for query latency as scanning huge amounts of data between storage tier and compute tier is going to be IO-bound over the network.  As with any database, optimized I/O is a critical concern to Presto. When possible, the priority is to not perform any I/O at all. This means that memory utilization and caching structures are of utmost importance.

Let’s understand the normal workflow of how Presto-Hive connector works –

  1. During a read operation, the planner sends a request to the metastore for metadata (partition info)
  2. Scheduler sends requests to remote storage to get a list of files and does the scheduling
  3. On the worker node, first, it receives the list of files from the scheduler and sends a request to remote storage to open a file and read the file footers
  4. Based on the footer, Presto understands what are the data blocks or chucks we need to read from remote storage
  5. Once workers read them, Presto performs computation on the leaf worker nodes based on join or aggregation and does the shuffle back to send query results to the client.

These are a lot of RPC calls not just for the Hive Metastore to get the partitions information but also for the remote storage to list files, schedule those files, to open files, and then to retrieve and read those data files from remote storage. Each of these IO paths for Hive connectors is a bottleneck on query performance and this is the reason we build multi-layer cache intelligently so that you can max cache hit rate and boost your query performance.

RaptorX introduces a total five types of caches and a scheduler. This cache system is only applicable to Hive connectors.

Multi-layer CacheTypeAffinity SchedulingBenefits
Data IO Local DiskRequiredReduced query latency
Intermediate Result SetLocal DiskRequiredReduced query latency and CPU utilization for aggregation queries 
File MetadataIn-memoryRequiredReduced CPU & latency decrease
Metastore In-memoryNAReduced query latency
File ListIn-memoryNAReduced query latency
Table: Summary of Presto Multi Layer Cache Implementation

Further, this article explains how you can configure and test various layers of RaptorX cache in your Presto cluster.

#1 Data(IO) cache

This cache makes use of a library which is built using the alluxio LocalCacheFileSystem which is an implementation of the HDFS interface. The alluxio data cache is the worker node local disk cache that stores the data read from the files(ORC,Parquet etc.,) on remote storage. The default page size on disk is 1MB. Uses LRU policy for evictions and in order to enable this cache we require local disks. 

To enable this cache, worker configuration needs to be updated with below properties at

etc/catalog/<catalog-name>.properties 

cache.enabled=true 
cache.type=ALLUXIO 
cache.alluxio.max-cache-size=150GB — This can be set based on the requirement. 
cache.base-directory=file:///mnt/disk1/cache

Also add below Alluxio property to coordinator and worker etc/jvm.config to emit all metrics related to Alluxio cache
-Dalluxio.user.app.id=presto

#2 Fragment result set cache

This is nothing but an intermediate reset set cache that lets you cache partially computed results set on the worker’s local SSD drive. This is to prevent duplicated computation upon multiple queries which will improve your query performance and decrease CPU usage. 

Add the following properties under the /config.properties

fragment-result-cache.enabled=true 
fragment-result-cache.max-cached-entries=1000000 
fragment-result-cache.base-directory=file:///data/presto-cache/2/fragmentcache 
fragment-result-cache.cache-ttl=24h

#3 Metastore cache

A Presto coordinator caches table metadata (schema, partition list, and partition info) to avoid long getPartitions calls to metastore. This cache is versioned to confirm validity of cached metadata.

In order to enable metadata cache set below properties at /<catalog-name>.properties 

hive.metastore-cache-scope=PARTITION
hive.metastore-cache-ttl=2d
hive.metastore-refresh-interval=3d
hive.metastore-cache-maximum-size=10000000

#4 File List cache

A Presto coordinator caches file lists from the remote storage partition directory to avoid long listFile calls to remote storage. This is coordinator only in-memory cache.

Enable file list cache by setting below set of properties at

/catalog/<catalog-name>.properties 

# List file cache
hive.file-status-cache-expire-time=24h 
hive.file-status-cache-size=100000000 
hive.file-status-cache-tables=*

#5 File metadata cache

Caches open file descriptors and stripe/file footer information in worker memory. These pieces of data are most frequently accessed when reading files. This cache is not just useful for decreasing query latency but also to reduce CPU utilization.

This is in memory cache and suitable for ORC and Parquet file formats.

For ORC, it includes file tail(postscript, file footer, file metadata), stripe footer and stripe stream(row indexes/bloom filters).

For Parquet, it caches the file and block level metadata.

In order to enable metadata cache set below properties at /<catalog-name>.properties 

# For ORC metadata cache: <catalog-name>.orc.file-tail-cache-enabled=true 
<catalog-name>.orc.file-tail-cache-size=100MB 
<catalog-name>.orc.file-tail-cache-ttl-since-last-access=6h 
<catalog-name>.orc.stripe-metadata-cache-enabled=true 
<catalog-name>.orc.stripe-footer-cache-size=100MB 
<catalog-name>.orc.stripe-footer-cache-ttl-since-last-access=6h 
<catalog-name>.orc.stripe-stream-cache-size=300MB 
<catalog-name>.orc.stripe-stream-cache-ttl-since-last-access=6h 

# For Parquet metadata cache: 
<catalog-name>.parquet.metadata-cache-enabled=true 
<catalog-name>.parquet.metadata-cache-size=100MB 
<catalog-name>.parquet.metadata-cache-ttl-since-last-access=6h

The <catalog-name> in the above configuration should be replaced by the catalog name that you are setting these in. For example, If the catalog properties file name is ahana_hive.properties then it should be replaced with “ahana_hive”. 

#6 Affinity scheduler

With affinity scheduling, Presto Coordinator schedules requests that process certain data/files to the same Presto worker node  to maximize the cache hits. Sending requests for the same data consistently to the same worker node means less remote calls to retrieve data.

Data caching is not supported with random node scheduling. Hence, this is a must have property that needs to be enabled in order to make RaptorX Data IO, Fragment result cache, and File metadata cache working. 

In order to enable affinity scheduler set below property at /catalog.properties

hive.node-selection-strategy=SOFT_AFFINITY

How can you test or debug your RaptorX cache setup with JMX metrics?

Each section describes queries to be run and query the jmx metrics to verify the cache usage.

Note: If your catalog is not named ‘ahana_hive’, you will need to change the table names to verify the cache usage. Substitute ahana_hive with your catalog name.

Data IO Cache

Queries to trigger Data IO cache usage

USE ahana_hive.default; 
SELECT count(*) from customer_orc group by nationkey; 
SELECT count(*) from customer_orc group by nationkey;

Queries to verify Data IO data cache usage

-- Cache hit rate.
SELECT * from 
jmx.current."com.facebook.alluxio:name=client.cachehitrate.presto,type=gauges";

-- Bytes read from the cache
SELECT * FROM 
jmx.current."com.facebook.alluxio:name=client.cachebytesreadcache.presto,type=meters";

-- Bytes requested from cache
SELECT * FROM 
jmx.current."com.facebook.alluxio:name=client.cachebytesrequestedexternal.presto,type=meters";

-- Bytes written to cache on each node.
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CacheBytesWrittenCache.presto,type=meters";

-- The number of cache pages(of size 1MB) currently on disk
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CachePages.presto,type=counters";

-- The amount of cache space available.
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CacheSpaceAvailable.presto,type=gauges";

-- There are many other metrics tables that you can view using the below command.
SHOW TABLES FROM 
jmx.current like '%alluxio%';

Fragment Result Cache

An example of the query plan fragment that is eligible for having its results cached is shown below.

Fragment 1 [SOURCE] 
Output layout: [count_3] Output partitioning: SINGLE [] Stage Execution 
Strategy: UNGROUPED_EXECUTION 
- Aggregate(PARTIAL) => [count_3:bigint] count_3 := "presto.default.count"(*) 
- TableScan[TableHandle {connectorId='hive', 
connectorHandle='HiveTableHandle{schemaName=default, tableName=customer_orc, 
analyzePartitionValues=Optional.empty}', 
layout='Optional[default.customer_orc{}]'}, gr Estimates: {rows: 150000 (0B), 
cpu: 0.00, memory: 0.00, network: 0.00} LAYOUT: default.customer_orc{}

Queries to trigger fragment result cache usage:

SELECT count(*) from customer_orc; 
SELECT count(*) from customer_orc;

Query Fragment Set Result cache JMX metrics.

-- All Fragment result set cache metrics like cachehit, cache entries, size, etc 
SELECT * FROM 
jmx.current."com.facebook.presto.operator:name=fragmentcachestats";

ORC metadata cache

Queries to trigger ORC cache usage

SELECT count(*) from customer_orc; 
SELECT count(*) from customer_orc;

Query ORC Metadata cache JMX metrics

-- File tail cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_orcfiletail,type=cachestatsmbean";

 -- Stripe footer cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripefooter,type=cachestatsmbean"; 

-- Stripe stream(Row index) cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripestream,type=cachestatsmbean";

Parquet metadata cache

Queries to trigger Parquet metadata cache

SELECT count(*) from customer_parquet; 
SELECT count(*) from customer_parquet;

Query Parquet Metadata cache JMX metrics.

-- Verify cache usage 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_parquetmetadata,type=cachestatsmbean";

File List cache

Query File List cache JMX metrics.

-- Verify cache usage 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive,type=cachingdirectorylister";

In addition to this, we have enabled these multilayer caches on Presto for Ahana Cloud by adding S3 support as the external filesystem for Data IO cache, more optimized scheduling and tooling to visualize the cache usage. 

Multi-level Data Lake Cashing with RaptorX
Figure: Multi-level Data Lake Cashing with RaptorX

Ahana-managed Presto clusters can take advantage of RaptorX cache and at Ahana we have simplified all these steps so that data platform users can enable these Data Lake caching seamlessly with just a one click. Ahana Cloud for Presto enables you to get up and running with the Open Data Lake Analytics stack in 30 minutes. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our on-demand webinar where we share how you can build an Open Data Lake Analytics stack.