Turbocharge Jupyter with the lightning-fast Presto SQL query engine

What is Jupyter?

Jupyter Notebook is open-source software for creating shareable documents and reports. Each such document can easily host a whole working data pipeline. The .ipynb files can contain code chunks, free and markdown-formatted text, and even visual objects for data analysis.

Jupyter became particularly popular among data scientists and engineers using Python because it works with most Python packages. A Jupyter Notebook can quickly digest and process significant amounts of data, and you just need a couple of command lines to install it using a command shell. 

While Jupyter has no drag-and-drop elements, it still offers a few simple no-code features. The software is called “notebook” due to its interface. In a cell, you can type in a chunk of code, run it, and see the results printed to a small console underneath. Then you add another chunk and repeat the operation, getting the results appended under the second chunk. Since you have your code and all outputs in one place, ordered chronologically, it is very similar to making notes with a paper block.

image from jupyter.org

Jupyter works well for data analysis or data pipeline drafting and testing. If you need some data insights and a report based on it, you can add your comments and annotations directly into the Notebook, run the Kernel, and export the whole document as an HTML file to be sent over. You can also build a data pipeline, load and process your data, export it into a database as an Excel file, or using one of the many other methods available. Once you are satisfied that it runs smoothly and delivers adequate results, you can deploy the pipeline in your main system.

Jupyter has a fairly steep learning curve. Fortunately, its contributors have put together thorough documentation to help new users and enable troubleshooting.

What is Presto

Presto is an open source SQL query engine used to query data lakes and other data stores in differing formats in a distributed and parallel manner. It was originally developed by Facebook to manage its data warehouse developed using Apache Hive. 

The Presto architecture consists of a coordinator node and multiple worker nodes. The coordinator node is responsible for accepting queries and returning results. The worker nodes do the actual computation and connect to the data stores. This distributed architecture makes Presto fast and scalable.

Faster Queries and Unified Access to more Data using Jupyter and Presto

You can use Presto with Jupyter Notebook to expand your data pools.

To connect to your Presto engine, you can use the Presto connector from the pyhive package. All Python-based connectors use similar logic. In the case of Presto, you need to enter your host, post, and then add your catalog and schema. After the connection has been established, you can run SQL queries to fetch the data.

With Presto, you can query both relational and non-relational databases and other data sources. Moreover, you can combine data from more than one data source inside only one query. Presto is also performant – calculations on terabytes of data takes minutes. Presto sends one query to multiple worker nodes using a coordinator node, distributing the workload. The coordinator will collect the data once it’s ready and dispatch it to you.