Flexible And Low Latency OLAP Using Apache Pinot and Presto for real time analytics

Apache Pinot Overview

Apache Pinot is a distributed, low latency online analytical processing (OLAP) platform used for carrying out fast big data analytics. Developed at LinkedIn in 2014, the highly scalable platform is meant to power time-sensitive analytics and is designed to have low latency and high throughput. It was open-sourced in 2015 and incubated by the Apache Software Foundation in 2018. Some of its use cases include high dimensional data analysis, business intelligence (BI), and providing users with profile view metrics. Other companies using Pinot include Uber, Microsoft, Target, Stripe, and Walmart.

Simplified View Of How Apache Pinot Works

Pinot is meant to be highly scalable and distributed while providing high throughput and fast turnaround time. To achieve this, related data from streaming sources like Kafka and data lakes like S3 are stored in tables. The tables are split into segments that are sets containing non-changing tuples. Segments are stored in a columnar manner and additionally contain metadata, zone maps, and indices related to contained tuples. Segments are stored and replicated among Pinot server nodes. Controller nodes contain global metadata related to all segments in a cluster like server node to segment mapping.

Pinot consists of four main components namely brokers, servers, minions, and controllers. The controller handles cluster management, scheduling, resource allocation, and a REST API for administration. The Pinot broker is responsible for receiving client queries, sending them to servers for execution, and returning the results of the queries to the client. Servers have segments that store data and handle most of the distributed processing. They are divided into offline and real-time servers, with offline servers typically containing immutable segments and real-time servers that ingest data from streaming sources. Minions are used for maintenance tasks not related to query processing like periodically purging data from a Pinot cluster for security and regulatory compliance reasons.

What is Presto?

Presto is a fast query engine able to handle processing in a parallel and distributed manner. It’s an open source, distributed SQL query engine.

Presto architecture consists of a coordinator node and multiple worker nodes. The coordinator node is responsible for accepting queries and returning results. The worker nodes do the actual computation and connect to the data stores. This distributed architecture makes Presto fast and scalable.

Fast and Flexible OLAP With Pinot and Presto

When carrying out analytics, system designers and developers normally have to make a tradeoff between querying flexibility and fast response times. The more flexible a system is, the slower its response time. Pinot is extremely fast but has limited flexibility while Presto is a bit slower but offers more flexibility. Having a Pinot cluster as the storage layer and a Presto cluster as the querying layer provides users with high throughput, low latency storage and powerful, flexible querying. Integration is achieved using an open source Presto Pinot connector that is responsible for managing connections and mapping queries and their results between the two platforms. Optimization is achieved by query pushdown to Pinot with Presto offering features lacking in Pinot like table joins.

You can learn more about the Apache Pinot connector for Presto in the PrestoCon session presented by the Apache Pinot team.

Get Started with Apache Pinot & Presto