Tutorial: How to run SQL queries with Presto on BigQuery

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial is about how to run SQL queries with Presto (running with Kubernetes) on Google BigQuery.

Pretos’s BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.

Step 1: Setup a Presto cluster with Kubernetes 

Set up your own Presto cluster on Kubernetes using these instructions or you can use Ahana’s managed service for Presto

Step 2: Setup a Google BigQuery Project with Google Cloud Platform

Create a Google BigQuery project from Google Cloud Console and make sure it’s up and running with dataset and tables as described here.

Below screen shows Google BigQuery project with table “Flights” 

SCaFoWYr3cmprK3ZSEvpBBsgC1ftG8KxE4HhUAmU1htxU2IUbn6mmLZJ7FhIhGM9WCRY6l8Tk3MlUTbPNxzqX851Uq

Step 3: Set up a key and download Google BigQuery credential JSON file.

To authenticate the BigQuery connector to access the BigQuery tables, create a credential key and download it in JSON format. 

Use a service account JSON key and GOOGLE_APPLICATION_CREDENTIALS as described here

Sample credential file should look like this:

{
  "type": "service_account",
  "project_id": "poised-journey-315406",
  "private_key_id": "5e66dd1787bb1werwerd5ddf9a75908b7dfaf84c",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwgKozSEK84b\ntNDXrwaTGbP8ZEddTSzMZQxcX7j3t4LQK98OO53i8Qgk/fEy2qaFuU2yM8NVxdSr\n/qRpsTL/TtDi8pTER0fPzdgYnbfXeR1Ybkft7+SgEiE95jzJCD/1+We1ew++JzAf\nZBNvwr4J35t15KjQHQSa5P1daG/JufsxytY82fW02JjTa/dtrTMULAFOSK2OVoyg\nZ4feVdxA2TdM9E36Er3fGZBQHc1rzAys4MEGjrNMfyJuHobmAsx9F/N5s4Cs5Q/1\neR7KWhac6BzegPtTw2dF9bpccuZRXl/mKie8EUcFD1xbXjum3NqMp4Gf7wxYgwkx\n0P+90aE7AgMBAAECggEAImgvy5tm9JYdmNVzbMYacOGWwjILAl1K88n02s/x09j6\nktHJygUeGmp2hnY6e11leuhiVcQ3XpesCwcQNjrbRpf1ajUOTFwSb7vfj7nrDZvl\n4jfVl1b6+yMQxAFw4MtDLD6l6ljKSQwhgCjY/Gc8yQY2qSd+Pu08zRc64x+IhQMn\nne1x0DZ2I8JNIoVqfgZd0LBZ6OTAuyQwLQtD3KqtX9IdddXVfGR6/vIvdT4Jo3en\nBVHLENq5b8Ex7YxnT49NEXfVPwlCZpAKUwlYBr0lvP2WsZakNCKnwMgtUKooIaoC\nSBxXrkmwQoLA0DuLO2B7Bhqkv/7zxeJnkFtKVWyckQKBgQC4GBIlbe0IVpquP/7a\njvnZUmEuvevvqs92KNSzCjrO5wxEgK5Tqx2koYBHhlTPvu7tkA9yBVyj1iuG+joe\n5WOKc0A7dWlPxLUxQ6DsYzNW0GTWHLzW0/YWaTY+GWzyoZIhVgL0OjRLbn5T7UNR\n25opELheTHvC/uSkwA6zM92zywKBgQC3PWZTY6q7caNeMg83nIr59+oYNKnhVnFa\nlzT9Yrl9tOI1qWAKW1/kFucIL2/sAfNtQ1td+EKb7YRby4WbowY3kALlqyqkR6Gt\nr2dPIc1wfL/l+L76IP0fJO4g8SIy+C3Ig2m5IktZIQMU780s0LAQ6Vzc7jEV1LSb\nxPXRWVd6UQKBgQCqrlaUsVhktLbw+5B0Xr8zSHel+Jw5NyrmKHEcFk3z6q+rC4uV\nMz9mlf3zUo5rlmC7jSdk1afQlw8ANBuS7abehIB3ICKlvIEpzcPzpv3AbbIv+bDz\nlM3CdYW/CZ/DTR3JHo/ak+RMU4N4mLAjwvEpRcFKXKsaXWzres2mRF43BQKBgQCY\nEf+60usdVqjjAp54Y5U+8E05u3MEzI2URgq3Ati4B4b4S9GlpsGE9LDVrTCwZ8oS\n8qR/7wmwiEShPd1rFbeSIxUUb6Ia5ku6behJ1t69LPrBK1erE/edgjOR6SydqjOs\nxcrW1yw7EteQ55aaS7LixhjITXE1Eeq1n5b2H7QmkQKBgBaZuraIt/yGxduCovpD\nevXZpe0M2yyc1hvv/sEHh0nUm5vScvV6u+oiuRnACaAySboIN3wcvDCIJhFkL3Wy\nbCsOWDtqaaH3XOquMJtmrpHkXYwo2HsuM3+g2gAeKECM5knzt4/I2AX7odH/e1dS\n0jlJKzpFpvpt4vh2aSLOxxmv\n-----END PRIVATE KEY-----\n",
  "client_email": "bigquery@poised-journey-678678.iam.gserviceaccount.com",
  "client_id": "11488612345677453667",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x505/bigquery%40poised-journey-315406.iam.gserviceaccount.com"
}

Pro-Tip: Before you move to the next step please try to use your downloaded credential JSON file with other third party sql tools like DBeaver to access your BigQuery Table. This is to make sure that your credentials have valid access rights or to isolate any issue with your credentials.

Step 4: Configure Presto Catalog for Google BigQuery Connector

To configure the BigQuery connector, you need to create a catalog properties file in etc/catalog named, for example, bigquery.properties, to mount the BigQuery connector as the bigquery catalog. You can create the file with the following contents, replacing the connection properties as appropriate for your setup. This should be done via the edit config map to make sure its reflected in the deployment:

kubectl edit configmap presto-catalog -n <cluster_name> -o yaml

Following are the catalog properties that need to be added:

connector.name=bigquery
bigquery.project-id=<your Google Cloud Platform project id>
bigquery.credentials-file=patch/for/bigquery-credentials.json

Following are the sample entries for catalog yaml file:

bigquery.properties:  |
connector.name=bigquery
bigquery.project-id=poised-journey-317806
bigquery.credentials-file=/opt/presto-server/etc/bigquery-credential.json

Step 5: Configure Presto Coordinator and workers with Google BigQuery credential file

To configure the BigQuery connector,

  1. Load the content of credential file as bigquery-credential.json in presto coordinator’s configmap: 

kubectl edit configmap presto-coordinator-etc -n <cluster_name> -o yaml

0aUUfLabQ91Z p5WH5epQ2Q7 X79jr6HEOGtombODt2rM083VBraTw9QWtBWrXsa36xPbebDuJeiz4 7SK5q5BadLGxX6BuaraZBpItTnQqORU0kLO1Rx6yLOOvNTJNSY
  1. Add a new session of volumeMounts for the credential file in coordinator’s deployment file: 

    kubectl edit deployment presto-coordinator -n <cluster_name> 

Following the sample configuration, That you can append in your coordinator’s deployment file at the end of volumeMounts section:

volumeMounts:
- mountPath: /opt/presto-server/etc/bigquery-credential.json
  name: presto-coordinator-etc-vol
  subPath: bigquery-credential.json
  1. Load the content of credential file as bigquery-credential.json in presto worker configmap: 

kubectl edit configmap presto-worker-etc -n <cluster_name>  -o yaml

9GKit7efkigufFrSRaq4q55DLfQcBY Dm3XMy4krolUbmc4yI3QQA00AEM 3L7Ro5V EwLAuYUs7ZYXOebNVF361bLec6HN8aI6p7NWbGWb4pbswHZFqVhW4E5fZ3L2k96cqsn1X
  1. Add a new session of volumeMounts for the credential file in worker’s deployment file: 

kubectl edit deployment presto-worker -n <cluster_name> 

Following the sample configuration, That you can append in your coordinator’s deployment file at the end of volumeMounts section:

volumeMounts:
- mountPath: /opt/presto-server/etc/bigquery-credential.json
  name: presto-worker-etc-vol
  subPath: bigquery-credential.json

Step 6: Setup database connection with Apache Superset

Create your own database connection url to query from Superset with below syntax

presto://<username>:<password>@bq.rohan1.dev.app:443/<catalog_name>

1zwAjBUbwaRRFfPjp1VquvCI dH8Ba7C5m41 FBCgBvb6vAr9UeNGW8bE 7ESJsgkeyPTbge5EJBk5JHUKO6k0Hd6ayFB0 7Xk6JNspcmWqEp n 5sdMRRtBmTMqd8k8dL4glSA

Step 7: Check for available datasets, schemas and tables, etc

After successfully database connection with Superset, Run following queries and make sure that the bigquery catalog gets picked up and perform show schemas and show tables to understand available data.  

show catalogs;

SBpCdf6 qZkSnSR4JMaEauAk20CKGoTx8quE7mmUNJWJJ27feHkwmCcOuLyMqkpjF3COl

show schemas from bigquery;

WJvDcPeSzfkSngHjSV oeacdp rVXqd53dwYFcyePYl6kWzGVy9rp29qcOWOJbhl7hKJemNVSb1OgOQYXrJIwryluVlD3q06jmzAYblGbHSMwVIX0q6WnfPdF7ET8Z0b3YroT085

show tables from bigquery.rohan88;

pVW1k3JaXGSB2K1k M3G2f0NhEogho7cACY fEdnB4GBz9MCs4B46R1zOvm1HLxHi9JYD0aWDjNUsTF bMOx5fMva3K6kdnREmucbxfLcmjCYw 36qds70m6om55TKD0db3tVld

Step 8: Run SQL query from Apache Superset to access BigQuery table

Once you access your database schema, you can run SQL queries against the tables as shown below. 

select * from catalog.schema.table;

select * from bigquery.rohan88.flights LIMIT1;

8DDiuBMCMikR rp2 oOjLNgDqM2S9j0XTW1oJbLmw 8tJv6E

You can perform similar queries from Presto Cli as well, here is another example of running sql queries on different Bigquery dataset from Preso Cli. 

$./presto-cli.jar --server https://<presto.cluster.url> --catalog bigquery --schema <schema_name> --user <presto_username> --password
mNFFvwlr1vHPt CRq07RdbIcE93B f7AA6bZaw vmzgMo5sipvg8SIpGOZ0ngZCeKOqpkIb7fy8DHPVsy4B44ZPzAQ7Yv

Following example shows how you can join Google BigQuery table with the Hive table from S3 and run sql queries. 

9In mgUZOQvbDt1sqLyqxFMXI6mCBGdvCKkFWlVvNQ1 SB3pNeQNmcwmNn URw54h u6a9aJrrZef0p23YFf790WeyLQCpGIPGxKl

At Ahana, we have made it very simple and user friendly to run SQL workloads on Presto in the cloud. You can get started with Ahana Cloud today and start running sql queries in a few mins.

Related Articles

How do I query a data lake with Presto?

Learn how to get up and running with Presto.

What is Presto (and FAQ’s about Presto)

This article will explain what Presto is and what it’s used for.