Turbocharge Zeppelin with the lightning-fast Presto SQL query engine
What is Zeppelin?
Zeppelin is an open-source tool for producing data science notebooks that are then stored as .json files.
This kind of analytics software is called a notebook. The interface of Zeppelin consists of two different kinds of spaces which are both called notes. You can use the first kind to type in your code, free text, or markdown-formatted annotations. Once you run the code, results appear underneath as an output.
The outputs can be charts, query results, or a markdown preview. In Zeppelin, you can organize such notes — code and results sections — into dashboards by stacking them or placing them next to each other.
With Zeppelin, you not only create data pipelines, but you can also visualize them as graphs to explain your ETL setup to others. If you want to make Zeppelin usable for your organization (like the marketing or sales teams, for example), you can build dynamic forms such as query templates. The users would only fill in the form and do not need to worry about the syntax and the correct structure of their queries.
Zeppelin works with a wide range of programming languages that are used in data analytics and data engineering. Among them are SQL, Python, Scala, shell, and markdown. This makes Zeppelin a popular tool for data scientists and engineers working with Apache data processing tools and frameworks.
Zeppelin offers multi-user management for collaborative work on big projects. Companies use Zeppelin to generate large amounts of structured and unstructured data that sometimes has to be stored in different databases and data lakes. You can pull data from them all into one notebook using different interpreters. There is no need to create multiple notebooks.
For querying across Zeppelin and other data sources, many have turned to Presto, an open-source distributed query engine.
What Is Presto?
Presto is an open source distributed SQL query engine that you can connect to many different data sources like databases, cloud storage, cloud storage, NoSQL databases, and many more. Many use Presto for their data lake analytics needs, as well as federated querying across a range of data sources.
A Presto cluster consists of a single coordinator and several worker nodes. The worker nodes are responsible for connecting to various sources and transparently carrying out query processing in a distributed and parallel manner. The computational power of a cluster can thus be increased by adding the number of worker nodes. This has made it an efficient choice for organizations with different data formats and sources and/or a large amount of data to process.
How Presto Works With Zeppelin
With Presto, you can query structured and unstructured data at once inside one Zeppelin note.
To integrate Presto with Zeppelin and across other data sources, you will need to add Presto as a new interpreter in Zeppelin. This can be done without any code as Zeppelin has an interface feature for it. You add a JDBC connector configured for Presto. Then, simply start writing a SQL query in a new note preceding it with the %jdbdc(presto) command to tell the note which interpreter to use.
Since Presto uses SQL, you won’t have to invest time into learning another niche programming language. Thus, speed, scalability, and familiarity make Presto a popular tool for real-time analytics on Zeppelin.