Streaming Data Processing Using Apache Kafka and Presto

Kafka Quick Start

Kafka is a distributed data streaming framework meant to enable the creation of highly scalable distributed systems. Developed at LinkedIn in 2008 and open-sourced in 2011, it was created to enable the creation of  decoupled yet conceptually connected systems. Broken down to the simplest level, Kafka provides a consistent, fast, and highly scalable log. Specifically, it is a commit log whereby all writes are guaranteed to be ordered and one cannot delete or modify the entries.

Once entries are added to the log, different systems can then process the entries, communicating with each other as needed, most likely by adding entries to the Kafka commit log. This enables the creation of software as a system of systems. Communication and processing happen in parallel and asynchronously, enabling each system to be developed, maintained, scaled, and enhanced as needed. Some of the companies using Kafka include Coursera, Netflix, Spotify, Activision, Uber, and Slack. 

Inner Workings Of Kafka

Kafka consists of producers that send messages to a Kafka node. These messages are grouped by topics to which consumers are subscribed. Each consumer receives all the messages sent to the topics it is subscribed to and carries out further processing as required. All the messages sent to the broker are stored for a given time or until they reach a given size on disk. Deployment is done in a cluster consisting of several brokers to ensure there is no single point of failure.

Messages sent to topics are split into partitions that are replicated in several nodes. The replication factor is determined by the performance and resilience requirements of the data/system. At any moment, one Kafka broker acts as the partition leader that owns the partition.  It is the node to which producers write their messages and consumers read them.

What is Presto?

Presto is a distributed query engine that allows the use of ANSI SQL to query data from multiple data sources. It holds processing and query results in memory, making it extremely efficient and fast. A presto cluster consists of a coordinator node and multiple worker nodes. The worker nodes are responsible for connecting to data stores via plugins/connectors and query processing.

Distributed Data Processing Using Kafka and Presto

Kafka and Presto are normally combined with Kafka providing real-time data pipelines and Presto provisioning distributed querying. This is easily achieved using Presto’s Kafka connector that provides access to Kafka topics. It is also possible to have Presto as the producer sending messages to Kafka which are processed by other applications like business intelligence (BI) and machine learning (ML) systems.

To connect Presto and Kafka, you need to have the Kafka cluster running. One then adds a catalog file with the connector.name value set to Kafka, then add the kafka.table-names which lists the topics from the cluster and kafka.nodes property that contains the nodes/s. If multiple Kafka clusters are available, connection with Presto is achieved by adding uniquely named catalog files for each cluster.

Get Started with Presto & Apache Kafka