Tutorial: How to run SQL queries with Presto on Google BigQuery

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial is about how to run SQL queries with Presto (running with Kubernetes) on Google BigQuery.

Pretos’s BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.

Step 1: Setup a Presto cluster with Kubernetes 

Set up your own Presto cluster on Kubernetes using these instructions or you can use Ahana’s managed service for Presto

Step 2: Setup a Google BigQuery Project with Google Cloud Platform

Create a Google BigQuery project from Google Cloud Console and make sure it’s up and running with dataset and tables as described here.

Below screen shows Google BigQuery project with table “Flights” 

Step 3: Set up a key and download Google BigQuery credential JSON file.

To authenticate the BigQuery connector to access the BigQuery tables, create a credential key and download it in JSON format. 

Use a service account JSON key and GOOGLE_APPLICATION_CREDENTIALS as described here

Sample credential file should look like this:

{
  "type": "service_account",
  "project_id": "poised-journey-315406",
  "private_key_id": "5e66dd1787bb1werwerd5ddf9a75908b7dfaf84c",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwgKozSEK84b\ntNDXrwaTGbP8ZEddTSzMZQxcX7j3t4LQK98OO53i8Qgk/fEy2qaFuU2yM8NVxdSr\n/qRpsTL/TtDi8pTER0fPzdgYnbfXeR1Ybkft7+SgEiE95jzJCD/1+We1ew++JzAf\nZBNvwr4J35t15KjQHQSa5P1daG/JufsxytY82fW02JjTa/dtrTMULAFOSK2OVoyg\nZ4feVdxA2TdM9E36Er3fGZBQHc1rzAys4MEGjrNMfyJuHobmAsx9F/N5s4Cs5Q/1\neR7KWhac6BzegPtTw2dF9bpccuZRXl/mKie8EUcFD1xbXjum3NqMp4Gf7wxYgwkx\n0P+90aE7AgMBAAECggEAImgvy5tm9JYdmNVzbMYacOGWwjILAl1K88n02s/x09j6\nktHJygUeGmp2hnY6e11leuhiVcQ3XpesCwcQNjrbRpf1ajUOTFwSb7vfj7nrDZvl\n4jfVl1b6+yMQxAFw4MtDLD6l6ljKSQwhgCjY/Gc8yQY2qSd+Pu08zRc64x+IhQMn\nne1x0DZ2I8JNIoVqfgZd0LBZ6OTAuyQwLQtD3KqtX9IdddXVfGR6/vIvdT4Jo3en\nBVHLENq5b8Ex7YxnT49NEXfVPwlCZpAKUwlYBr0lvP2WsZakNCKnwMgtUKooIaoC\nSBxXrkmwQoLA0DuLO2B7Bhqkv/7zxeJnkFtKVWyckQKBgQC4GBIlbe0IVpquP/7a\njvnZUmEuvevvqs92KNSzCjrO5wxEgK5Tqx2koYBHhlTPvu7tkA9yBVyj1iuG+joe\n5WOKc0A7dWlPxLUxQ6DsYzNW0GTWHLzW0/YWaTY+GWzyoZIhVgL0OjRLbn5T7UNR\n25opELheTHvC/uSkwA6zM92zywKBgQC3PWZTY6q7caNeMg83nIr59+oYNKnhVnFa\nlzT9Yrl9tOI1qWAKW1/kFucIL2/sAfNtQ1td+EKb7YRby4WbowY3kALlqyqkR6Gt\nr2dPIc1wfL/l+L76IP0fJO4g8SIy+C3Ig2m5IktZIQMU780s0LAQ6Vzc7jEV1LSb\nxPXRWVd6UQKBgQCqrlaUsVhktLbw+5B0Xr8zSHel+Jw5NyrmKHEcFk3z6q+rC4uV\nMz9mlf3zUo5rlmC7jSdk1afQlw8ANBuS7abehIB3ICKlvIEpzcPzpv3AbbIv+bDz\nlM3CdYW/CZ/DTR3JHo/ak+RMU4N4mLAjwvEpRcFKXKsaXWzres2mRF43BQKBgQCY\nEf+60usdVqjjAp54Y5U+8E05u3MEzI2URgq3Ati4B4b4S9GlpsGE9LDVrTCwZ8oS\n8qR/7wmwiEShPd1rFbeSIxUUb6Ia5ku6behJ1t69LPrBK1erE/edgjOR6SydqjOs\nxcrW1yw7EteQ55aaS7LixhjITXE1Eeq1n5b2H7QmkQKBgBaZuraIt/yGxduCovpD\nevXZpe0M2yyc1hvv/sEHh0nUm5vScvV6u+oiuRnACaAySboIN3wcvDCIJhFkL3Wy\nbCsOWDtqaaH3XOquMJtmrpHkXYwo2HsuM3+g2gAeKECM5knzt4/I2AX7odH/e1dS\n0jlJKzpFpvpt4vh2aSLOxxmv\n-----END PRIVATE KEY-----\n",
  "client_email": "bigquery@poised-journey-678678.iam.gserviceaccount.com",
  "client_id": "11488612345677453667",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x505/bigquery%40poised-journey-315406.iam.gserviceaccount.com"
}

Pro-Tip: Before you move to the next step please try to use your downloaded credential JSON file with other third party sql tools like DBeaver to access your BigQuery Table. This is to make sure that your credentials have valid access rights or to isolate any issue with your credentials.

Step 4: Configure Presto Catalog for Google BigQuery Connector

To configure the BigQuery connector, you need to create a catalog properties file in etc/catalog named, for example, bigquery.properties, to mount the BigQuery connector as the bigquery catalog. You can create the file with the following contents, replacing the connection properties as appropriate for your setup. This should be done via the edit config map to make sure its reflected in the deployment:

kubectl edit configmap presto-catalog -n <cluster_name> -o yaml

Following are the catalog properties that need to be added:

connector.name=bigquery
bigquery.project-id=<your Google Cloud Platform project id>
bigquery.credentials-file=patch/for/bigquery-credentials.json

Following are the sample entries for catalog yaml file:

bigquery.properties:  |
connector.name=bigquery
bigquery.project-id=poised-journey-317806
bigquery.credentials-file=/opt/presto-server/etc/bigquery-credential.json

Step 5: Configure Presto Coordinator and workers with Google BigQuery credential file

To configure the BigQuery connector,

  1. Load the content of credential file as bigquery-credential.json in presto coordinator’s configmap: 

kubectl edit configmap presto-coordinator-etc -n <cluster_name> -o yaml

  1. Add a new session of volumeMounts for the credential file in coordinator’s deployment file: 

    kubectl edit deployment presto-coordinator -n <cluster_name> 

Following the sample configuration, That you can append in your coordinator’s deployment file at the end of volumeMounts section:

volumeMounts:
- mountPath: /opt/presto-server/etc/bigquery-credential.json
  name: presto-coordinator-etc-vol
  subPath: bigquery-credential.json
  1. Load the content of credential file as bigquery-credential.json in presto worker configmap: 

kubectl edit configmap presto-worker-etc -n <cluster_name>  -o yaml

  1. Add a new session of volumeMounts for the credential file in worker’s deployment file: 

kubectl edit deployment presto-worker -n <cluster_name> 

Following the sample configuration, That you can append in your coordinator’s deployment file at the end of volumeMounts section:

volumeMounts:
- mountPath: /opt/presto-server/etc/bigquery-credential.json
  name: presto-worker-etc-vol
  subPath: bigquery-credential.json

Step 6: Setup database connection with Apache Superset

Create your own database connection url to query from Superset with below syntax

presto://<username>:<password>@bq.rohan1.dev.app:443/<catalog_name>

Step 7: Check for available datasets, schemas and tables, etc

After successfully database connection with Superset, Run following queries and make sure that the bigquery catalog gets picked up and perform show schemas and show tables to understand available data.  

show catalogs;

show schemas from bigquery;

show tables from bigquery.rohan88;

Step 8: Run SQL query from Apache Superset to access BigQuery table

Once you access your database schema, you can run SQL queries against the tables as shown below. 

select * from catalog.schema.table;

select * from bigquery.rohan88.flights LIMIT1;

You can perform similar queries from Presto Cli as well, here is another example of running sql queries on different Bigquery dataset from Preso Cli. 

$./presto-cli.jar --server https://<presto.cluster.url> --catalog bigquery --schema <schema_name> --user <presto_username> --password

Following example shows how you can join Google BigQuery table with the Hive table from S3 and run sql queries. 

At Ahana, we have made it very simple and user friendly to run SQL workloads on Presto in the cloud. You can get started with Ahana Cloud today and start running sql queries in a few mins.