Business Intelligence And Data Analysis With Druid and Presto

Apache Druid Helicopter View

Apache Druid is a distributed, columnar database aimed at developing analytical solutions. It offers a real-time analytics database able to ingest and query massive amounts of data quickly and store the data safely. It was developed by Metamarkets in 2011, open-sourced in 2012, and made an Apache project in 2015. Some of the companies using Druid include Paypal, Cisco, British Telecom (BT), Reddit, Salesforce, Splunk, Unity, and Verizon.

Druid incorporates ideas from data warehousing, cloud computing, and distributed systems. Its architecture provides many characteristics and features that make it a top candidate for an enterprise, real-time data analysis datastore. Druid runs on a cluster of multiple nodes, offering high scalability, concurrency, availability, and fault tolerance.

The Apache Druid Cluster Architecture

Druid nodes are of various types, each serving a specialized function. Realtime node read and index streaming data, creating segments stored until forwarded to historical nodes. Historical nodes store and read immutable data segments in deep storage like S3 and HDFS. Coordinator nodes handle data management by handling features like segment-historical node assignment, load balancing, and replication.

Overlord nodes handle a Druid’s cluster task and data ingestion management. They assign the tasks to Middle Manager nodes that process the tasks and provide features like indexing in real-time. Broker nodes provide an interface between the cluster and clients and accept queries, send them to the appropriate real-time/historical nodes, accept the query results, and return the final results to the client. Druid has optional Router nodes providing proxying services for request management to Overlord and Coordinator nodes and query routing services to Broker nodes.

What is Presto?

Presto is an open source SQL query engine built for data lake analytics and ad hoc query. It was developed to meet Facebook OLAP needs against their Hive data lake. Its design goals include fast and parallel query processing, creating a virtual data warehouse from disparate datastores via a plugin architecture, and having a highly scalable and distributed query engine. 

Presto is deployed in production as a cluster of nodes for improving the performance and scalability.

Druid and Presto Data Analysis Application

Druid and Presto are usually combined to create highly scalable, parallel, distributed real-time analytics, business intelligence (BI), and online analytical processing (OLAP) solutions. Since both platforms are open source, users can enjoy their power without investing in purchase/licensing costs if they’re ok managing both on their own. Having Druid processing real-time data and handling ad-hoc querying enables real-time analytics to be realized on a Presto-powered stack. Presto allows users to perform join queries from disparate data sources. Therefore, they can select the datastore that best meets their diverse needs e.g. online transaction processing (OLTP) databases like MySQL, document-orient databases like MongoDB, or/and geospatial databases like PostGIS.

Integrating Druid with Presto is done via the Presto Druid connector. This requires the creation of a catalog properties file that configures the connection. The first property is the connector.name property that needs to be set to druid.the druid.broker-url and druid.coordinator-url accepts the URL to the broker and coordinator respectively in the hostname:port format. Query pushdown is enabled by setting druid.compute-pushdown-enabled to true.

Get Started with Presto & Druid