Connecting to Presto with Superset

This blog post will provide you with an understanding of how to connect Superset to Presto.

TL;DR

Superset refers to a connection to a distinct data source as a database. A single Presto cluster can connect to multiple data sources by configuring a Presto catalog for each desired data source. Hence, to make a Superset database connection to a particular data source through Presto, you must specify the Presto cluster and catalog in the SQLAlchemy URI as follows: presto://<presto-username>:<presto-password>@<presto-coordinator-url>:<http-server-port>/<catalog>.

Superset and SQLAlchemy

Superset is built as a Python Flask web application and leverages SQLAlchemy, a Python SQL toolkit, to provide a consistent abstraction layer to relational data sources. Superset uses a consistent SQLAlchemy URI as a connection string for a defined Superset database. The schema for the URI is as follows: dialect+driver://username:password@host:port/database. We will deconstruct the dialect, driver, and database in the following sections.

SQLAlchemy defines a dialect as the system it uses to communicate with the specifics various databases (e.g. flavor of SQL) and DB-API, low level Python APIs to talk to specific relational data sources. A Python DB-API database driver is required for a given data source. For example, PyHive is a DB-API driver to connect to Presto. It is possible for a single dialect to choose between multiple DB-API drivers. For example, PostgreSQL dialect can support the following DB-API drivers: psycopg2, pg8000, psycop2cffi, an pygresql. Typically, a single DB-API driver is set as the default for a dialect and used when no explicit DB-API is specified. For PostgreSQL, the default DB-API driver is psycopg2.

The term database can be confusing since it is heavily loaded. In a typical scenario a given data source, such as PostgeSQL, have multiple logical groupings of tables which are called “databases”. In a way, these “databases” provide namespaces for tables; identically named tables can exist in two different “databases” without collision. As an example, we can use the PostgreSQL instance available when locally installing Superset with Docker Compose.

In this instance of PostgreSQL, we have four databases: postgres, superset, template0, and template1.

superset@localhost:superset> \\l

+-----------+----------+----------+------------+------------+-----------------------+
| Name      | Owner    | Encoding | Collate    | Ctype      | Access privileges     |
|-----------+----------+----------+------------+------------+-----------------------|
| postgres  | superset | UTF8     | en_US.utf8 | en_US.utf8 | <null>                |
| superset  | superset | UTF8     | en_US.utf8 | en_US.utf8 | <null>                |
| template0 | superset | UTF8     | en_US.utf8 | en_US.utf8 | =c/superset           |
|           |          |          |            |            | superset=CTc/superset |
| template1 | superset | UTF8     | en_US.utf8 | en_US.utf8 | =c/superset           |
|           |          |          |            |            | superset=CTc/superset |
+-----------+----------+----------+------------+------------+-----------------------+

We can look into the superset database and see the tables in that database.

The key thing to remember here is that ultimately a Superset database needs to resolve to a collection of tables, whatever that is referred to in a particular dialect.

superset@localhost:superset> \c superset

You are now connected to database "superset" as user "superset"

+--------+----------------------------+-------+----------+
| Schema | Name                       | Type  | Owner    |
|--------+----------------------------+-------+----------|
| public | Clean                      | table | superset |
| public | FCC 2018 Survey            | table | superset |
| public | ab_permission              | table | superset |
| public | ab_permission_view         | table | superset |
| public | ab_permission_view_role    | table | superset |
| public | ab_register_user           | table | superset |
| public | ab_role                    | table | superset |
| public | ab_user                    | table | superset |
| public | ab_user_role               | table | superset |
| public | ab_view_menu               | table | superset |
| public | access_request             | table | superset |
| public | alembic_version            | table | superset |
| public | alert_logs                 | table | superset |
| public | alert_owner                | table | superset |
| public | alerts                     | table | superset |
| public | annotation                 | table | superset |
| public | annotation_layer           | table | superset |
| public | bart_lines                 | table | superset |
| public | birth_france_by_region     | table | superset |
| public | birth_names                | table | superset |
| public | cache_keys                 | table | superset |
| public | channel_members            | table | superset |
| public | channels                   | table | superset |
| public | cleaned_sales_data         | table | superset |
| public | clusters                   | table | superset |
| public | columns                    | table | superset |
| public | covid_vaccines             | table | superset |
:

With an understanding of dialects, drivers, and databases under our belt, let’s solidify it with a few examples. Let’s assume we want to create a Superset database to a PostgreSQL data source and particular PostgreSQL database named mydatabase. Our PostgreSQL data source is hosted at pghost on port 5432 and we will log in as sonny (password is foobar). Here are three SQLAlchemy URIs we could use (actually inspired from the SQLAlchemy documentation):

  1. postgresql+psycopg2://sonny:foobar@pghost:5432/mydatabase We explicitly specify the postgresql dialect and psycopg2 driver.
  2. postgresql+pg8000://sonny:foobar@pghost:5432/mydatabase We use the pg8000 driver.
  3. postgresql://sonny:foobar@pghost:5432/mydatabase We do not explicitly list any driver, and hence, SQLAlchemy will use the default driver, which is psycopg2 for postgresql.

Superset lists its recommended Python packages for database drivers in the public documentation.

Presto Catalogs

Because Presto can connect to multiple data sources, when connecting to Presto as a defined Superset database, it’s important to understand what you are actually making a connection to.

In Presto, the equivalent notion of a “database” (i.e. logical collection of tables) is called a schema. Access to a specific schema (“database”) in a data source, is defined in a catalog.

As an example, the listing below is the equivalent catalog configuration to connect to the example mydatabase PostgreSQL database we described previously. If we were querying a table in that catalog directly from Presto, a fully-qualified table would be specified as catalog.schema.table (e.g. select * from catalog.schema.table). Hence, querying the Clean table would be select * from postgresql.mydatabase.Clean.

connector.name=postgresql
connection-url=jdbc:postgresql://pghost:5432/mydatabase
connection-user=sonny
connection-password=foobar

Superset to Presto

Going back to Superset, to create a Superset database to connect to Presto, we specify the Presto dialect. However, because Presto is the intermediary to an underlying data source, such as PostgreSQL, the username and password we need to provide (and authenticate against) is the Presto username and password. Further, we must specify a Presto catalog for the database in the SQLAlchemy URI. From there, Presto—-through its catalog configuration—-authenticates to the backing data source with the appropriate credentials (e.g sonny and foobar ). Hence, the SQLAlchemy URI to connect to Presto in Superset is as follows: presto://<presto-username>:<presto-password>@<presto-coordinator-url>:<http-server-port>/<catalog>

The http-server-port refers to the http-server.http.port configuration on the coordinator and workers (see Presto config properties); it is usually set to 8080.

New Superset Database Connection UI

In Superset 1.3, there is a feature-flagged version of a new database connection UI that simplifies connecting to data without constructing the SQLAlchemy URI. The new database connection UI can be turned on in config.py with FORCE_DATABASE_CONNECTIONS_SSL = True (PR #14934). The new UI can also be viewed in the Superset documentation.

Try It Out!

In less than 30 minutes, you can get up and running using Superset with a Presto cluster with Ahana Cloud for Presto. Ahana Cloud for Presto is an easy-to-use fully managed Presto service that also automatically stands up a Superset instance for you. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.