ahana + cube_integration

Announcing the Cube integration with Ahana: Querying multiple data sources with managed Presto and Cube

See how Ahana and Cube work together to help you set up a Presto cluster and build a single source of truth for metrics without spending days reading cryptic docs

Ahana provides managed Presto clusters running in your AWS account.

Presto is an open-source distributed SQL query engine, originally developed at Facebook, now hosted under the Linux Foundation. It connects to multiple databases or other data sources (for example, Amazon S3). We can use a Presto cluster as a single compute engine for an entire data lake.

Presto implements the data federation feature: you can process data from multiple sources as if they were stored in a single database. Because of that, you don’t need a separate ETL (Extract-Transform-Load) pipeline to prepare the data before using it. However, running and configuring a single-point-of-access for multiple databases (or file systems) requires Ops skills and an additional effort.

However, no data engineer wants to do the Ops work. Using Ahana, you can deploy a Presto cluster within minutes without spending hours configuring the service, VPCs, and AWS access rights. Ahana hides the burden of infrastructure management and allows you to focus on processing your data.

What is Cube?

Cube is a headless BI platform for accessing, organizing, and delivering data. Cube connects to many data warehouses, databases, or query engines, including Presto, and allows you to quickly build data applications or analyze your data in BI tools. It serves as the single source of truth for your business metrics.

855489ee 4cef 4146 876b e44d0ebcc711

This article will demonstrate the caching functionality, access control, and flexibility of the data retrieval API.

Integration

Cube’s battle-tested Presto driver provides the out-of-the-box connectivity to Ahana.

You just need to provide the credentials: Presto host name and port, user name and password, Presto catalog and schema. You’ll also need to set CUBEJS_DB_SSL to true since Ahana has secures Presto connections with SSL.

Check the docs to learn more about connecting Cube to Ahana.

Example: Parsing logs from multiple data sources with Ahana and Cube

Let’s build a real-world data application with Ahana and Cube.

We will use Ahana to join Amazon Sagemaker Endpoint logs stored as JSON files in S3 with the data retrieved from a PostgreSQL database.

Suppose you work at a software house specializing in training ML models for your clients and delivering ML inference as a REST API. You have just trained new versions of all models, and you would like to demonstrate the improvements to the clients.

Because of that, you do a canary deployment of the versions and gather the predictions from the new and the old models using the built-in logging functionality of AWS Sagemaker Endpoints: a managed deployment environment for machine learning models. Additionally, you also track the actual production values provided by your clients.

You need all of that to prepare personalized dashboards showing the results of your hard work.

Let us show you how Ahana and Cube work together to help you achieve your goal quickly without spending days reading cryptic documentation.

You will retrieve the prediction logs from an S3 bucket and merge them with the actual values stored in a PostgreSQL database. After that, you calculate the ML performance metrics, implement access control, and hide the data source complexity behind an easy-to-use REST API.

Architecture diagram

In the end, you want a dashboard looking like this:

The final result: two dashboards showing the number of errors made by two variants of the ML model

The final result: two dashboards showing the number of errors made by two variants of the ML model

How to configure Ahana?

Allowing Ahana to access your AWS account

First, let’s login to Ahana, and connect it to your AWS account. We must create an IAM role allowing Ahana to access our AWS account.

On the setup page, click the “Open CloudFormation” button. After clicking the button, we get redirected to the AWS page for creating a new CloudFormation stack from a template provided by Ahana. Create the stack and wait until CloudFormation finishes the setup.

When the IAM role is configured, click the stack’s Outputs tab and copy the AhanaCloudProvisioningRole key value.

The Outputs tab containing the identifier of the IAM role for Ahana

We have to paste it into the Role ARN field on the Ahana setup page and click the “Complete Setup” button.

The Ahana setup page

Creating an Ahana cluster

After configuring AWS access, we have to start a new Ahana cluster.

In the Ahana dashboard, click the “Create new cluster” button.

Ahana create new cluster

In the setup window, we can configure the type of the AWS EC2 instances used by the cluster, scaling strategy, and the Hive Metastore. If you need a detailed description of the configuration options, look at the “Create new cluster” section of the Ahana documentation.

Ahana cluster setup page

Remember to add at least one user to your cluster! When we are satisfied with the configuration, we can click the “Create cluster” button. Ahana needs around 20-30 minutes to setup a new cluster.

Retrieving data from S3 and PostgreSQL

After deploying a Presto cluster, we have to connect our data sources to the cluster because, in this example, the Sagemaker Endpoint logs are stored in S3 and PostgreSQL.

Adding a PostgreSQL database to Ahana

In the Ahana dashboard, click the “Add new data source” button. We will see a page showing all supported data sources. Let’s click the “Amazon RDS for PostgreSQL” option.

In the setup form displayed below, we have to provide the database configuration and click the “Add data source” button.

PostgreSQL data source configuration

Adding an S3 bucket to Ahana

AWS Sagemaker Endpoint stores their logs in an S3 bucket as JSON files. To access those files in Presto, we need to configure the AWS Glue data catalog and add the data catalog to the Ahana cluster.

We have to login to the AWS console, open the AWS Glue page and add a new database to the data catalog (or use an existing one).

AWS Glue databases

Now, let’s add a new table. We won’t configure it manually. Instead, let’s create a Glue crawler to generate the table definition automatically. On the AWS Glue page, we have to click the “Crawlers” link and click the “Add crawler” button.

AWS Glue crawlers

After typing the crawler’s name and clicking the “Next” button, we will see the Source Type page. On this page, we have to choose the”Data stores” and “Crawl all folders” (in our case, “Crawl new folders only” would work too).

Here we specify where the crawler should look for new data
Here we specify where the crawler should look for new data

On the “Data store” page, we pick the S3 data store, select the S3 connection (or click the “Add connection” button if we don’t have an S3 connection configured yet), and specify the S3 path.

Note that Sagemaker Endpoints store logs in subkeys using the following key structure: endpoint-name/model-variant/year/month/day/hour. We want to use those parts of the key as table partitions.

Because of that, if our Sagemaker logs have an S3 key: s3://the_bucket_name/sagemaker/logs/endpoint-name/model-variant-name/year/month/day/hour, we put only the s3://the_bucket_name/sagemaker/logs key prefix in the setup window!

IAM role configuration

Let’s click the “Next” button. In the subsequent window, we choose “No” when asked whether we want to configure another data source. Glue setup will ask about the name of the crawler’s IAM role. We can create a new one:

c80af05b 95d9 428d 8bea b3f720fb5931

Next, we configure the crawler’s schedule. A Sagemaker Endpoint adds new log files in near real-time. Because of that, it makes sense to scan the files and add new partitions every hour:

configuring the crawler's schedule

In the output configuration, we need to customize the settings.

First, let’s select the Glue database where the new tables get stored. After that, we modify the “Configuration options.”

We pick the “Add new columns only” because we will make manual changes in the table definition, and we don’t want the crawler to overwrite them. Also, we want to add new partitions to the table, so we check the “Update all new and existing partitions with metadata from the table.” box.

Crawler's output configuration

Let’s click “Next.” We can check the configuration one more time in the review window and click the “Finish” button.

Now, we can wait until the crawler runs or open the AWS Glue Crawlers view and trigger the run manually. When the crawler finishes running, we go to the Tables view in AWS Glue and click the table name.

AWS Glue tables

In the table view, we click the “Edit table” button and change the “Serde serialization lib” to “org.apache.hive.hcatalog.data.JsonSerDe” because the AWS JSON serialization library isn’t available in the Ahana Presto cluster.

JSON serialization configured in the table details view

We should also click the “Edit schema” button and change the default partition names to values shown in the screenshot below:

Default partition names replaced with their actual names

After saving the changes, we can add the Glue data catalog to our Ahana Presto cluster.

Configuring data sources in the Presto cluster

Go back to the Ahana dashboard and click the “Add data source” button. Select the “AWS Glue Data Catalog for Amazon S3” option in the setup form.

AWS Glue data catalog setup in Ahana

Let’s select our AWS region and put the AWS account id in the “Glue Data Catalog ID” field. After that, we click the “Open CloudFormation” button and apply the template. We will have to wait until CloudFormation creates the IAM role.

When the role is ready, we copy the role ARN from the Outputs tab and paste it into the “Glue/S3 Role ARN” field:

The "Outputs" tab shows the ARN of the IAM role used to access the Glue data catalog from Ahana
The “Outputs” tab shows the ARN of the IAM role used to access the Glue data catalog from Ahana

On the Ahana setup page, we click the “Add data source” button.

Adding data sources to an existing cluster

Finally, we can add both data sources to our Ahana cluster.

We have to open the Ahana “Clusters” page, click the “Manage” button, and scroll down to the “Data Sources‚” section. In this section, we click the “Manage data sources” button.

We will see another setup page where we check the boxes next to the data sources we want to configure and click the “Modify cluster” button. We will need to confirm that we want to restart the cluster to make the changes.

Adding data sources to an Ahana cluster

Writing the Presto queries

The actual structure of the input and output from an AWS Sagemaker Endpoint depends on us. We can send any JSON request and return a custom JSON object.

Let’s assume that our endpoint receives a request containing the input data for the machine learning model and a correlation id. We will need those ids to join the model predictions with the actual data.

Example input:

{"time_series": [51, 37, …, 7], "correlation_id": "cf8b7b9a-6b8a-45fe-9814-11a4b17c710a"}

In the response, the model returns a JSON object with a single “prediction”key and a decimal value:

{"prediction": 21.266147618448954}

A single request in Sagemaker Endpoint logs looks like this:

{"captureData": {"endpointInput": {"observedContentType": "application/json", "mode": "INPUT", "data": "eyJ0aW1lX3NlcmllcyI6IFs1MS40MjM5MjAzODYxNTAzODUsIDM3LjUwOTk2ODc2MTYwNzM0LCAzNi41NTk4MzI2OTQ0NjAwNTYsIDY0LjAyMTU3MzEyNjYyNDg0LCA2MC4zMjkwMzU2MDgyMjIwODUsIDIyLjk1MDg0MjgxNDg4MzExLCA0NC45MjQxNTU5MTE1MTQyOCwgMzkuMDM1NzA4Mjg4ODc2ODA1LCAyMC44NzQ0Njk2OTM0MzAxMTUsIDQ3Ljc4MzY3MDQ3MjI2MDI1NSwgMzcuNTgxMDYzNzUyNjY5NTE1LCA1OC4xMTc2MzQ5NjE5NDM4OCwgMzYuODgwNzExNTAyNDIxMywgMzkuNzE1Mjg4NTM5NzY5ODksIDUxLjkxMDYxODYyNzg0ODYyLCA0OS40Mzk4MjQwMTQ0NDM2OCwgNDIuODM5OTA5MDIxMDkwMzksIDI3LjYwOTU0MTY5MDYyNzkzLCAzOS44MDczNzU1NDQwODYyOCwgMzUuMTA2OTQ4MzI5NjQwOF0sICJjb3JyZWxhdGlvbl9pZCI6ICJjZjhiN2I5YS02YjhhLTQ1ZmUtOTgxNC0xMWE0YjE3YzcxMGEifQ==", "encoding": "BASE64"}, "endpointOutput": {"observedContentType": "application/json", "mode": "OUTPUT", "data": "eyJwcmVkaWN0aW9uIjogMjEuMjY2MTQ3NjE4NDQ4OTU0fQ==", "encoding": "BASE64"}}, "eventMetadata": {"eventId": "b409a948-fbc7-4fa6-8544-c7e85d1b7e21", "inferenceTime": "2022-05-06T10:23:19Z"}

AWS Sagemaker Endpoints encode the request and response using base64. Our query needs to decode the data before we can process it. Because of that, our Presto query starts with data decoding:

with sagemaker as (
  select
  model_name,
  variant_name,
  cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.data)), '$.correlation_id') as varchar) as correlation_id,
  cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.data)), '$.prediction') as double) as prediction
  from s3.sagemaker_logs.logs
)
, actual as (
  select correlation_id, actual_value
  from postgresql.public.actual_values
)

After that, we join both data sources and calculate the absolute error value:

sql
, logs as (
  select model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as actual
  from sagemaker
  left outer join actual
  on sagemaker.correlation_id = actual.correlation_id
)
, errors as (
  select abs(prediction - actual) as abs_err, model_name, model_variant from logs
),

Now, we need to calculate the percentiles using the `approx_percentile` function. Note that we group the percentiles by model name and model variant. Because of that, Presto will produce only a single row per every model-variant pair. That’ll be important when we write the second part of this query.

percentiles as (
  select approx_percentile(abs_err, 0.1) as perc_10,
  approx_percentile(abs_err, 0.2) as perc_20,
  approx_percentile(abs_err, 0.3) as perc_30,
  approx_percentile(abs_err, 0.4) as perc_40,
  approx_percentile(abs_err, 0.5) as perc_50,
  approx_percentile(abs_err, 0.6) as perc_60,
  approx_percentile(abs_err, 0.7) as perc_70,
  approx_percentile(abs_err, 0.8) as perc_80,
  approx_percentile(abs_err, 0.9) as perc_90,
  approx_percentile(abs_err, 1.0) as perc_100,
  model_name,
  model_variant
  from errors
  group by model_name, model_variant
)

In the final part of the query, we will use the filter expression to count the number of values within buckets. Additionally, we return the bucket boundaries. We need to use an aggregate function max (or any other aggregate function) because of the group by clause. That won’t affect the result because we returned a single row per every model-variant pair in the previous query.

SELECT count(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10
, max(perc_10) as perc_10_value
, count(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20
, max(perc_20) as perc_20_value
, count(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30
, max(perc_30) as perc_30_value
, count(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40
, max(perc_40) as perc_40_value
, count(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50
, max(perc_50) as perc_50_value
, count(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60
, max(perc_60) as perc_60_value
, count(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70
, max(perc_70) as perc_70_value
, count(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80
, max(perc_80) as perc_80_value
, count(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90
, max(perc_90) as perc_90_value
, count(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100
, max(perc_100) as perc_100_value
, p.model_name, p.model_variant
FROM percentiles p, errors e group by p.model_name, p.model_variant

How to configure Cube?

In our application, we want to display the distribution of absolute prediction errors.

We will have a chart showing the difference between the actual value and the model’s prediction. Our chart will split the absolute errors into buckets (percentiles) and display the number of errors within every bucket.

If the new variant of the model performs better than the existing model, we should see fewer large errors in the charts. A perfect (and unrealistic) model would produce a single error bar in the left-most part of the chart with the “0” label.

At the beginning of the article, we looked at an example chart that shows no significant difference between both model variants:

example chart_Both models perform almost the same

If the variant B were better than the variant A, its chart could look like this (note the axis values in both pictures

An improved second version of the model_example chart

Creating a Cube deployment

Cube Cloud is the easiest way to get started with Cube. It provides a fully managed, ready to use Cube cluster. However, if you prefer self-hosting, then follow this tutorial.

First, please create a new Cube Cloud deployment. Then, open the “Deployments” page and click the “Create deployment” button.

Cube Deployments dashboard page

We choose the Presto cluster:

Database connections supported by Cube

Finally, we fill out the connection parameters and click the “Apply”button. Remember to enable the SSL connection!

Presto configuration page

Defining the data model in Cube

We have our queries ready to copy-paste, and we have configured a Presto connection in Cube. Now, we can define the Cube schema to retrieve query results.

Let’s open the Schema view in Cube and add a new file.

The schema view in Cube showing where we should click to create a new file

In the next window, type the file name errorpercentiles.js and click “Create file.”

87e853ec 6099 48cd 8376 c0e6f9780841

In the following paragraphs, we will explain parts of the configuration and show you code fragments to copy-paste. You don’t have to do that in such small steps!

Below, you see the entire content of the file. Later, we explain the configuration parameters.

const measureNames = [
  'perc_10', 'perc_10_value',
  'perc_20', 'perc_20_value',
  'perc_30', 'perc_30_value',
  'perc_40', 'perc_40_value',
  'perc_50', 'perc_50_value',
  'perc_60', 'perc_60_value',
  'perc_70', 'perc_70_value',
  'perc_80', 'perc_80_value',
  'perc_90', 'perc_90_value',
  'perc_100', 'perc_100_value',
];

const measures = Object.keys(measureNames).reduce((result, name) => {
  const sqlName = measureNames[name];
  return {
    ...result,
    [sqlName]: {
      sql: () => sqlName,
      type: `max`
    }
  };
}, {});

cube('errorpercentiles', {
  sql: `with sagemaker as (
    select
    model_name,
    variant_name,
    cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.data)), '$.correlation_id') as varchar) as correlation_id,
    cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.data)), '$.prediction') as double) as prediction
    from s3.sagemaker_logs.logs
  )
, actual as (
  select correlation_id, actual_value
  from postgresql.public.actual_values
)
, logs as (
  select model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as actual
  from sagemaker
  left outer join actual
  on sagemaker.correlation_id = actual.correlation_id
)
, errors as (
  select abs(prediction - actual) as abs_err, model_name, model_variant from logs
),
percentiles as (
  select approx_percentile(abs_err, 0.1) as perc_10,
  approx_percentile(abs_err, 0.2) as perc_20,
  approx_percentile(abs_err, 0.3) as perc_30,
  approx_percentile(abs_err, 0.4) as perc_40,
  approx_percentile(abs_err, 0.5) as perc_50,
  approx_percentile(abs_err, 0.6) as perc_60,
  approx_percentile(abs_err, 0.7) as perc_70,
  approx_percentile(abs_err, 0.8) as perc_80,
  approx_percentile(abs_err, 0.9) as perc_90,
  approx_percentile(abs_err, 1.0) as perc_100,
  model_name,
  model_variant
  from errors
  group by model_name, model_variant
)
SELECT count(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10
, max(perc_10) as perc_10_value
, count(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20
, max(perc_20) as perc_20_value
, count(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30
, max(perc_30) as perc_30_value
, count(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40
, max(perc_40) as perc_40_value
, count(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50
, max(perc_50) as perc_50_value
, count(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60
, max(perc_60) as perc_60_value
, count(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70
, max(perc_70) as perc_70_value
, count(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80
, max(perc_80) as perc_80_value
, count(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90
, max(perc_90) as perc_90_value
, count(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100
, max(perc_100) as perc_100_value
, p.model_name, p.model_variant
FROM percentiles p, errors e group by p.model_name, p.model_variant`,

preAggregations: {
// Pre-Aggregations definitions go here
// Learn more here: https://cube.dev/docs/caching/pre-aggregations/getting-started
},

joins: {
},

measures: measures,
dimensions: {
  modelVariant: {
    sql: `model_variant`,
    type: 'string'
  },
  modelName: {
    sql: `model_name`,
    type: 'string'
  },
}
});

In the sql property, we put the query prepared earlier. Note that your query MUST NOT contain a semicolon.

A newly created cube configuration file

We will group and filter the values by the model and variant names, so we put those columns in the dimensions section of the cube configuration. The rest of the columns are going to be our measurements. We can write them out one by one like this:


measures: {
  perc_10: {
    sql: `perc_10`,
    type: `max`
  },
  perc_20: {
    sql: `perc_20`,
    type: `max`
  },
  perc_30: {
    sql: `perc_30`,
    type: `max`
  },
  perc_40: {
    sql: `perc_40`,
    type: `max`
  },
  perc_50: {
    sql: `perc_50`,
    type: `max`
  },
  perc_60: {
    sql: `perc_60`,
    type: `max`
  },
  perc_70: {
    sql: `perc_70`,
    type: `max`
  },
  perc_80: {
    sql: `perc_80`,
    type: `max`
  },
  perc_90: {
    sql: `perc_90`,
    type: `max`
  },
  perc_100: {
    sql: `perc_100`,
    type: `max`
  },
  perc_10_value: {
    sql: `perc_10_value`,
    type: `max`
  },
  perc_20_value: {
    sql: `perc_20_value`,
    type: `max`
  },
  perc_30_value: {
    sql: `perc_30_value`,
    type: `max`
  },
  perc_40_value: {
    sql: `perc_40_value`,
    type: `max`
  },
  perc_50_value: {
    sql: `perc_50_value`,
    type: `max`
  },
  perc_60_value: {
    sql: `perc_60_value`,
    type: `max`
  },
  perc_70_value: {
    sql: `perc_70_value`,
    type: `max`
  },
  perc_80_value: {
    sql: `perc_80_value`,
    type: `max`
  },
  perc_90_value: {
    sql: `perc_90_value`,
    type: `max`
  },
  perc_100_value: {
    sql: `perc_100_value`,
    type: `max`
  }
},
dimensions: {
  modelVariant: {
    sql: `model_variant`,
    type: 'string'
  },
  modelName: {
    sql: `model_name`,
    type: 'string'
  },
}
A part of the error percentiles configuration in Cube

The notation we have shown you has lots of repetition and is quite verbose. We can shorten the measurements defined in the code by using JavaScript to generate them.

We had to add the following code before using the cube function!

First, we have to create an array of column names:


const measureNames = [
  'perc_10', 'perc_10_value',
  'perc_20', 'perc_20_value',
  'perc_30', 'perc_30_value',
  'perc_40', 'perc_40_value',
  'perc_50', 'perc_50_value',
  'perc_60', 'perc_60_value',
  'perc_70', 'perc_70_value',
  'perc_80', 'perc_80_value',
  'perc_90', 'perc_90_value',
  'perc_100', 'perc_100_value',
];

Now, we must generate the measures configuration object. We iterate over the array and create a measure configuration for every column:


const measures = Object.keys(measureNames).reduce((result, name) => {
  const sqlName = measureNames[name];
  return {
    ...result,
    [sqlName]: {
      sql: () => sqlName,
      type: `max`
    }
  };
}, {});

Finally, we can replace the measure definitions with:

measures: measures

After changing the file content, click the “Save All” button.

The top section of the schema view

And click the Continue button in the popup window.

The popup window shows the URL of the test API

In the Playground view, we can test our query by retrieving the chart data as a table (or one of the built-in charts):

An example result in the Playground view

Configuring access control in Cube

In the Schema view, open the cube.js file.

We will use the queryRewrite configuration option to allow or disallow access to data.

First, we will reject all API calls without the models field in the securityContext. We will put the identifier of the models the user is allowed to see in their JWT token. The security context contains all of the JWT token variables.

For example, we can send a JWT token with the following payload. Of course, in the application sending queries to Cube, we must check the user’s access right and set the appropriate token payload. Authentication and authorization are beyond the scope of this tutorial, but please don’t forget about them.

The Security Context window in the Playground view

After rejecting unauthorized access, we add a filter to all queries.

We can distinguish between the datasets accessed by the user by looking at the data specified in the query. We need to do it because we must filter by the modelName property of the correct table.

In our queryRewrite configuration in the cube.js file, we use the query.filter.push function to add a modelName IN (model_1, model_2, ...) clause to the SQL query:

module.exports = {
  queryRewrite: (query, { securityContext }) => {
    if (!securityContext.models) {
      throw new Error('No models found in Security Context!');
    }
    query.filters.push({
      member: 'percentiles.modelName',
      operator: 'in',
      values: securityContext.models,
    });
    return query;
  },
};

Configuring caching in Cube

By default, Cube caches all Presto queries for 2 minutes. Even though Sagemaker Endpoints stores logs in S3 in near real-time, we aren’t interested in refreshing the data so often. Sagemaker Endpoints store the logs in JSON files, so retrieving the metrics requires a full scan of all files in the S3 bucket.

When we gather logs over a long time, the query may take some time. Below, we will show you how to configure the caching in Cube. We recommend doing it when the end-user application needs over one second to load the data.

For the sake of the example, we will retrieve the value only twice a day.

Preparing data sources for caching

First, we must allow Presto to store data in both PostgreSQL and S3. It’s required because, in the case of Presto, Cube supports only the simple pre-aggregation strategy. Therefore, we need to pre-aggregate the data in the source databases before loading them into Cube.

In PostgreSQL, we grant permissions to the user account used by Presto to access the database:

GRANT CREATE ON SCHEMA the_schema_we_use TO the_user_used_in_presto;
GRANT USAGE ON SCHEMA the_schema_we_use TO the_user_used_in_presto;

If we haven’t modified anything in the AWS Glue data catalog, Presto already has permission to create new tables and store their data in S3, but the schema doesn’t contain the target S3 location yet, so all requests will fail.

We must login to AWS Console, open the Glue data catalog, and create a new database called prod_pre_aggregations. In the database configuration, we must specify the S3 location for the table content.

If you want to use a different database name, follow the instructions in our documentation.

f70a3069 71cd 4b6c a945 9d63d3ff3c26

Caching configuration in Cube

Let’s open the errorpercentiles.js schema file. Below the SQL query, we put the preAggregations configuration:

preAggregations: {
  cacheResults: {
    type: `rollup`,
    measures: [
      errorpercentiles.perc_10, errorpercentiles.perc_10_value,
      errorpercentiles.perc_20, errorpercentiles.perc_20_value,
      errorpercentiles.perc_30, errorpercentiles.perc_30_value,
      errorpercentiles.perc_40, errorpercentiles.perc_40_value,
      errorpercentiles.perc_50, errorpercentiles.perc_50_value,
      errorpercentiles.perc_60, errorpercentiles.perc_60_value,
      errorpercentiles.perc_70, errorpercentiles.perc_70_value,
      errorpercentiles.perc_80, errorpercentiles.perc_80_value,
      errorpercentiles.perc_90, errorpercentiles.perc_90_value,
      errorpercentiles.perc_100, errorpercentiles.perc_100_value
    ],
    dimensions: [errorpercentiles.modelName, errorpercentiles.modelVariant],
    refreshKey: {
      every: `12 hour`,
    },
  },
},

After testing the development version, we can also deploy the changes to production using the “Commit & Push”button. When we click it, we will be asked to type the commit message:

An empty “Commit Changes & Push”view

When we commit the changes, the deployment of a new version of the endpoint will start. A few minutes later, we can start sending queries to the endpoint.

We can also check the pre-aggregations window to verify whether Cube successfully created the cached data.

Successfully cached pre-aggregations

Now, we can move to the Playground tab and run our query. We should see the “Query was accelerated with pre-aggregation”message if Cube used the cached values to handle the request.

The message that indicates that our pre-aggregation works correctly

Building the front-end application

Cube can connect to a variety of tools, including Jupyter Notebooks, Superset, and Hex. However, we want a fully customizable dashboard, so we will build a front-end application.

Our dashboard consists of two parts: the website and the back-end service. In the web part, we will have only the code required to display the charts. In the back-end, we will handle authentication and authorization. The backend service will also send requests to the Cube REST API.

Getting the Cube API key and the API URL

Before we start, we have to copy the Cube API secret. Open the settings page in Cube Cloud’s web UI and click the “Env vars”tab. In the tab, you will see all of the Cube configuration variables. Click the eye icon next to the CUBEJS_API_SECRET and copy the value.

The Env vars tab on the settings page

We also need the URL of the Cube endpoint. To get this value, click the “Copy API URL” link in the top right corner of the screen.

The location of the Copy API URL link

Back end for front end

Now, we can write the back-end code.

First, we have to authenticate the user. We assume that you have an authentication service that verifies whether the user has access to your dashboard and which models they can access. In our examples, we expect those model names in an array stored in the allowedModels variable.

After getting the user’s credentials, we have to generate a JWT to authenticate Cube requests. Note that we have also defined a variable for storing the CUBE_URL. Put the URL retrieved in the previous step as its value.

​​const jwt = require('jsonwebtoken');
CUBE_URL = '';
function create_cube_token() {
  const CUBE_API_SECRET = your_token; // Don’t store it in the code!!!
  // Pass it as an environment variable at runtime or use the
  // secret management feature of your container orchestration system

  const cubejsToken = jwt.sign(
    { "models": allowedModels },
    CUBE_API_SECRET,
    { expiresIn: '30d' }
  );
  
  return cubejsToken;
}

We will need two endpoints in our back-end service: the endpoint returning the chart data and the endpoint retrieving the names of models and variants we can access.

We create a new express application running in the node server and configure the /models endpoint:

const request = require('request');
const express = require('express')
const bodyParser = require('body-parser')
const port = 5000;
const app = express()

app.use(bodyParser.json())
app.get('/models', getAvailableModels);

app.listen(port, () => {
  console.log(`Server is running on port ${port}`)
})

In the getAvailableModels function, we query the Cube Cloud API to get the model names and variants. It will return only the models we are allowed to see because we have configured the Cube security context:

Our function returns a list of objects containing the modelName and modelVariant fields.

function getAvailableModels(req, res) {
  res.setHeader('Content-Type', 'application/json');
  request.post(CUBE_URL + '/load', {
    headers: {
      'Authorization': create_cube_token(),
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({"query": {
      "dimensions": [
        "errorpercentiles.modelName",
        "errorpercentiles.modelVariant"
      ],
      "timeDimensions": [],
      "order": {
        "errorpercentiles.modelName": "asc"
      }
    }})
  }, (err, res_, body) => {
    if (err) {
      console.log(err);
    }
    body = JSON.parse(body);
    response = body.data.map(item => {
      return {
        modelName: item["errorpercentiles.modelName"],
        modelVariant: item["errorpercentiles.modelVariant"]
      }
    });
    res.send(JSON.stringify(response));
  });
};

Let’s retrieve the percentiles and percentile buckets. To simplify the example, we will show only the query and the response parsing code. The rest of the code stays the same as in the previous endpoint.

The query specifies all measures we want to retrieve and sets the filter to get data belonging to a single model’s variant. We could retrieve all data at once, but we do it one by one for every variant.

{
  "query": {
    "measures": [
      "errorpercentiles.perc_10",
      "errorpercentiles.perc_20",
      "errorpercentiles.perc_30",
      "errorpercentiles.perc_40",
      "errorpercentiles.perc_50",
      "errorpercentiles.perc_60",
      "errorpercentiles.perc_70",
      "errorpercentiles.perc_80",
      "errorpercentiles.perc_90",
      "errorpercentiles.perc_100",
      "errorpercentiles.perc_10_value",
      "errorpercentiles.perc_20_value",
      "errorpercentiles.perc_30_value",
      "errorpercentiles.perc_40_value",
      "errorpercentiles.perc_50_value",
      "errorpercentiles.perc_60_value",
      "errorpercentiles.perc_70_value",
      "errorpercentiles.perc_80_value",
      "errorpercentiles.perc_90_value",
      "errorpercentiles.perc_100_value"
    ],
    "dimensions": [
        "errorpercentiles.modelName",
        "errorpercentiles.modelVariant"
    ],
    "filters": [
      {
        "member": "errorpercentiles.modelName",
        "operator": "equals",
        "values": [
          req.query.model
        ]
      },
      {
        "member": "errorpercentiles.modelVariant",
        "operator": "equals",
        "values": [
          req.query.variant
        ]
      }
    ]
  }
}

The response parsing code extracts the number of values in every bucket and prepares bucket labels:

response = body.data.map(item => {
  return {
    modelName: item["errorpercentiles.modelName"],
    modelVariant: item["errorpercentiles.modelVariant"],
    labels: [
      "<=" + item['percentiles.perc_10_value'],
      item['errorpercentiles.perc_20_value'],
      item['errorpercentiles.perc_30_value'],
      item['errorpercentiles.perc_40_value'],
      item['errorpercentiles.perc_50_value'],
      item['errorpercentiles.perc_60_value'],
      item['errorpercentiles.perc_70_value'],
      item['errorpercentiles.perc_80_value'],
      item['errorpercentiles.perc_90_value'],
      ">=" + item['errorpercentiles.perc_100_value']
    ],
    values: [
      item['errorpercentiles.perc_10'],
      item['errorpercentiles.perc_20'],
      item['errorpercentiles.perc_30'],
      item['errorpercentiles.perc_40'],
      item['errorpercentiles.perc_50'],
      item['errorpercentiles.perc_60'],
      item['errorpercentiles.perc_70'],
      item['errorpercentiles.perc_80'],
      item['errorpercentiles.perc_90'],
      item['errorpercentiles.perc_100']
    ]
  }
})

Dashboard website

In the last step, we build the dashboard website using Vue.js.

If you are interested in copy-pasting working code, we have prepared the entire example in a CodeSandbox. Below, we explain the building blocks of our application.

We define the main Vue component encapsulating the entire website content. In the script section, we will download the model and variant names. In the template, we iterate over the retrieved models and generate a chart for all of them.

We put the charts in the Suspense component to allow asynchronous loading.

To keep the example short, we will skip the CSS style part.

<script setup>
  import OwnerName from './components/OwnerName.vue'
  import ChartView from './components/ChartView.vue'
  import axios from 'axios'
  import { ref } from 'vue'
  const models = ref([]);
  axios.get(SERVER_URL + '/models').then(response => {
    models.value = response.data
  });
</script>

<template>
  <header>
    <div class="wrapper">
      <OwnerName name="Test Inc." />
    </div>
  </header>
  <main>
    <div v-for="model in models" v-bind:key="model.modelName">
      <Suspense>
        <ChartView v-bind:title="model.modelName" v-bind:variant="model.modelVariant" type="percentiles"/>
      </Suspense>
    </div>
  </main>
</template>

The OwnerName component displays our client’s name. We will skip its code as it’s irrelevant in our example.

In the ChartView component, we use the vue-chartjs library to display the charts. Our setup script contains the required imports and registers the Chart.js components:

Äãimport { Bar } from 'vue-chartjs'
import { Chart as ChartJS, Title, Tooltip, Legend, BarElement, CategoryScale, LinearScale } from 'chart.js'
import { ref } from 'vue'
import axios from 'axios'
ChartJS.register(Title, Tooltip, Legend, BarElement, CategoryScale, LinearScale);

We have bound the title, variant, and chart type to the ChartView instance. Therefore, our component definition must contain those properties:

const props = defineProps({
  title: String,
  variant: String,
  type: String
})

Next, we retrieve the chart data and labels from the back-end service. We will also prepare the variable containing the label text:

const response = await axios.get(SERVER_URL + '/' + props.type + '?model=' + props.title + '&variant=' + props.variant)
const data = response.data[0].values;
const labels = response.data[0].labels;
const label_text = "Number of prediction errors of a given value"

Finally, we prepare the chart configuration variables:

const chartData = ref({
  labels: labels,
  datasets: [
    {
      label: label_text,
      backgroundColor: '#f87979',
      data: data
    }
  ],
});

const chartOptions = {
  plugins: {
    title: {
      display: true,
      text: props.title + ' - ' + props.variant,
    },
  },
  legend: {
    display: false
  },
  tooltip: {
    enabled: false
  }
}

In the template section of the Vue component, we pass the configuration to the Bar instance:

<template>
  <Bar ref="chart" v-bind:chart-data="chartData" v-bind:chart-options="chartOptions" />
</template>

If we have done everything correctly, we should see a dashboard page with error distributions.

Charts displaying the error distribution for different model variants

Wrapping up

Thanks for following this tutorial.

We encourage you to spend some time reading the Cube and Ahana documentation.

Please don’t hesitate to like and bookmark this post, write a comment, give Cube a star on GitHub, join Cube’s Slack community, and subscribe to the Ahana newsletter.

Streaming Data Processing Using Apache Kafka and Presto

Kafka Quick Start

kafka logo wide

Kafka is a distributed data streaming framework meant to enable the creation of highly scalable distributed systems. Developed at LinkedIn in 2008 and open-sourced in 2011, it was created to enable the creation of  decoupled yet conceptually connected systems. Broken down to the simplest level, Kafka provides a consistent, fast, and highly scalable log. Specifically, it is a commit log whereby all writes are guaranteed to be ordered and one cannot delete or modify the entries.

Once entries are added to the log, different systems can then process the entries, communicating with each other as needed, most likely by adding entries to the Kafka commit log. This enables the creation of software as a system of systems. Communication and processing happen in parallel and asynchronously, enabling each system to be developed, maintained, scaled, and enhanced as needed. Some of the companies using Kafka include Coursera, Netflix, Spotify, Activision, Uber, and Slack. 

Inner Workings Of Kafka

Kafka consists of producers that send messages to a Kafka node. These messages are grouped by topics to which consumers are subscribed. Each consumer receives all the messages sent to the topics it is subscribed to and carries out further processing as required. All the messages sent to the broker are stored for a given time or until they reach a given size on disk. Deployment is done in a cluster consisting of several brokers to ensure there is no single point of failure.

Messages sent to topics are split into partitions that are replicated in several nodes. The replication factor is determined by the performance and resilience requirements of the data/system. At any moment, one Kafka broker acts as the partition leader that owns the partition.  It is the node to which producers write their messages and consumers read them.

What is Presto?

logo presto

Presto is a distributed query engine that allows the use of ANSI SQL to query data from multiple data sources. It holds processing and query results in memory, making it extremely efficient and fast. A presto cluster consists of a coordinator node and multiple worker nodes. The worker nodes are responsible for connecting to data stores via plugins/connectors and query processing.

Distributed Data Processing Using Kafka and Presto

Kafka and Presto are normally combined with Kafka providing real-time data pipelines and Presto provisioning distributed querying. This is easily achieved using the Presto Kafka connector that provides access to Kafka topics. It is also possible to have Presto as the producer sending messages to Kafka which are processed by other applications like business intelligence (BI) and machine learning (ML) systems.

Screen Shot 2021 05 12 at 11.59.47 AM

To connect Presto and Kafka, you need to have the Kafka cluster running. One then adds a catalog file with the connector.name value set to Kafka, then add the kafka.table-names which lists the topics from the cluster and kafka.nodes property that contains the nodes/s. If multiple Kafka clusters are available, connection with Presto is achieved by adding uniquely named catalog files for each cluster.

Get Started with Presto & Apache Kafka

Business Intelligence And Data Analysis With Druid and Presto

Apache Druid Helicopter View

druid

Apache Druid is a distributed, columnar database aimed at developing analytical solutions. It offers a real-time analytics database able to ingest and query massive amounts of data quickly and store the data safely. It was developed by Metamarkets in 2011, open-sourced in 2012, and made an Apache project in 2015. Some of the companies using Druid include Paypal, Cisco, British Telecom (BT), Reddit, Salesforce, Splunk, Unity, and Verizon.

Druid incorporates ideas from data warehousing, cloud computing, and distributed systems. Its architecture provides many characteristics and features that make it a top candidate for an enterprise, real-time data analysis datastore. Druid runs on a cluster of multiple nodes, offering high scalability, concurrency, availability, and fault tolerance.

The Apache Druid Cluster Architecture

Druid nodes are of various types, each serving a specialized function. Realtime node read and index streaming data, creating segments stored until forwarded to historical nodes. Historical nodes store and read immutable data segments in deep storage like S3 and HDFS. Coordinator nodes handle data management by handling features like segment-historical node assignment, load balancing, and replication.

Overlord nodes handle a Druid’s cluster task and data ingestion management. They assign the tasks to Middle Manager nodes that process the tasks and provide features like indexing in real-time. Broker nodes provide an interface between the cluster and clients and accept queries, send them to the appropriate real-time/historical nodes, accept the query results, and return the final results to the client. Druid has optional Router nodes providing proxying services for request management to Overlord and Coordinator nodes and query routing services to Broker nodes.

What is Presto?

logo presto

Presto is an open source SQL query engine built for data lake analytics and ad hoc query. It was developed to meet Facebook OLAP needs against their Hive data lake. Its design goals include fast and parallel query processing, creating a virtual data warehouse from disparate datastores via a plugin architecture, and having a highly scalable and distributed query engine. 

Presto is deployed in production as a cluster of nodes for improving the performance and scalability.

Druid and Presto Data Analysis Application

Druid and Presto are usually combined to create highly scalable, parallel, distributed real-time analytics, business intelligence (BI), and online analytical processing (OLAP) solutions. Since both platforms are open source, users can enjoy their power without investing in purchase/licensing costs if they’re ok managing both on their own. Having Druid processing real-time data and handling ad-hoc querying enables real-time analytics to be realized on a Presto-powered stack. Presto allows users to perform join queries from disparate data sources. Therefore, they can select the datastore that best meets their diverse needs e.g. online transaction processing (OLTP) databases like MySQL, document-orient databases like MongoDB, or/and geospatial databases like PostGIS.

Screen Shot 2021 05 12 at 11.54.33 AM

Integrating Druid with Presto is done via the Presto Druid connector. This requires the creation of a catalog properties file that configures the connection. The first property is the connector.name property that needs to be set to druid.the druid.broker-url and druid.coordinator-url accepts the URL to the broker and coordinator respectively in the hostname:port format. Query pushdown is enabled by setting druid.compute-pushdown-enabled to true.

Get Started with Presto & Druid

Flexible And Low Latency OLAP Using Apache Pinot and Presto for real time analytics

Apache Pinot Overview

pinot

Apache Pinot is a distributed, low latency online analytical processing (OLAP) platform used for carrying out fast big data analytics. Developed at LinkedIn in 2014, the highly scalable platform is meant to power time-sensitive analytics and is designed to have low latency and high throughput. It was open-sourced in 2015 and incubated by the Apache Software Foundation in 2018. Some of its use cases include high dimensional data analysis, business intelligence (BI), and providing users with profile view metrics. Other companies using Pinot include Uber, Microsoft, Target, Stripe, and Walmart.

Simplified View Of How Apache Pinot Works

Pinot is meant to be highly scalable and distributed while providing high throughput and fast turnaround time. To achieve this, related data from streaming sources like Kafka and data lakes like S3 are stored in tables. The tables are split into segments that are sets containing non-changing tuples. Segments are stored in a columnar manner and additionally contain metadata, zone maps, and indices related to contained tuples. Segments are stored and replicated among Pinot server nodes. Controller nodes contain global metadata related to all segments in a cluster like server node to segment mapping.

Screen Shot 2021 05 12 at 11.17.16 AM

Pinot consists of four main components namely brokers, servers, minions, and controllers. The controller handles cluster management, scheduling, resource allocation, and a REST API for administration. The Pinot broker is responsible for receiving client queries, sending them to servers for execution, and returning the results of the queries to the client. Servers have segments that store data and handle most of the distributed processing. They are divided into offline and real-time servers, with offline servers typically containing immutable segments and real-time servers that ingest data from streaming sources. Minions are used for maintenance tasks not related to query processing like periodically purging data from a Pinot cluster for security and regulatory compliance reasons.

What is Presto?

Presto is a fast query engine able to handle processing in a parallel and distributed manner. It’s an open source, distributed SQL query engine.

Presto architecture consists of a coordinator node and multiple worker nodes. The coordinator node is responsible for accepting queries and returning results. The worker nodes do the actual computation and connect to the data stores. This distributed architecture makes Presto fast and scalable.

Fast and Flexible OLAP With Pinot and Presto

When carrying out analytics, system designers and developers normally have to make a tradeoff between querying flexibility and fast response times. The more flexible a system is, the slower its response time. Pinot is extremely fast but has limited flexibility while Presto is a bit slower but offers more flexibility. Having a Pinot cluster as the storage layer and a Presto cluster as the querying layer provides users with high throughput, low latency storage and powerful, flexible querying. Integration is achieved using an open source Presto Pinot connector that is responsible for managing connections and mapping queries and their results between the two platforms. Optimization is achieved by query pushdown to Pinot with Presto offering features lacking in Pinot like table joins.

Screen Shot 2021 05 12 at 11.18.09 AM

You can learn more about the Apache Pinot connector for Presto in the PrestoCon session presented by the Apache Pinot team.

Get Started with Apache Pinot & Presto

Turbocharge your Analytics with MongoDB And Presto

High-Level View Of MongoDB

mongo logo

MongoDB is a NoSQL distributed document database meant to handle diverse data management requirements. Its design goals include creating an object-oriented, highly available, scalable, efficient, and ACID (Atomicity, Consistency, Isolation, and Durability) featuring database. Its document model enables data to be stored in its most natural form as opposed to the relational model, making users more productive. It supports both schemaless and schema design, offering both flexibility as well as data integrity and consistency enforcement as needed. Some of the organizations using MongoDB include Google, SAP, Verizon, Intuit, Sega, Adobe, InVision, and EA Sports.

A Look At MongoDB Architecture

MongoDB stores data in documents in the Binary JSON (BSON) format. Logically related documents are grouped into collections that are indexed. Mongodb servers that store data form shards are grouped into replica sets. Replica sets have the same data replicated among them, with the default replication factor being 3 servers. Data is partitioned into chunks, which combined with sharding and replication provides high reliability and availability. During partitioning, consistency is ensured by having the database write unavailable. Config servers have configuration data and metadata related to the MongoDB clusters. Mongo’s Routers accept queries and return results to clients and are responsible for directing queries to the correct shards. 

MongoDB Deployment

MongoDB is cross-platform and can be installed on all major operating systems. It can either be installed manually, deployed on private and/or public clouds, or accessed via premium cloud offerings. Recommended practice in production is to have multiple nodes running MongoDB instances, forming a cluster.

What is Presto?

logo presto

Presto is an open source SQL query engine that provides a scalable and high throughput query engine capable of accessing different data stores including MySQL, DB2, Oracle, Cassandra, Redis, S3, and MongoDB. This enables the creation of a virtualized data lake of all data. Combining Presto with MongoDB creates a highly scalable and cohesive yet loosely decoupled data management stack.

Scalable Analytics With MongoDB and Presto

Screen Shot 2021 05 12 at 10.26.47 AM

Combining MongoDB and Presto provides a highly scalable tech stack for developing distributed analytical applications. MongoDB is an enterprise distributed database capable of storing data as strictly as users need it to be and ensure high horizontal scalability, availability, resilience, and self-healing. Designers and developers can choose the data model that best serves them, trading flexibility for strictness in the schema design and performance for transactional integrity in write operations. Different clusters can be created as needed to meet different goals as per performance and functional needs.

For example, writes can be unacknowledged, acknowledged, or replica-acknowledged, with faster writes being achieved with weaker write enforcement. Reads can be performed from secondary, primary-preferred, and primary nodes for a tradeoff between turnaround times and fetching stale data. This makes it a great storage layer for OLAP systems. Data can be persisted as accurately or read as fast as possible as per each application’s need. Integration is achieved using the Presto MongoDB connector.

Can you insert a JSON document into MongoDB with Presto?

This question comes up quite a bit. In short, yes you can do this. You’d be running an insert statement from Presto to Mongo. If you use Presto, you’d insert it as a table. For example:

INSERT INTO orders VALUES(1, 'bad', 50.0, current_date);

That insert would go into MongoDB as a JSON document.

Getting started with Presto in the cloud

If you want to get started with Presto quickly in the cloud, try out Ahana Cloud for free. Ahana takes care of the deployment, management, adding/detaching data sources, etc. for you. It’s a managed service for Presto that makes it really easy to get started. You can try it free at https://ahana.io/sign-up 

Utilizing SiSense and Presto For Distributed Data Analysis

Data Analysis Using SiSense

1280px Sisense Logo.svg

SiSense is a full-fledged data analysis platform that provides actionable data analysis to its users for data-driven management. Being an end-to-end data analysis platform makes leveraging disparate sources of data, carrying out analysis on the data, and presenting the results to the users in an easy-to-use manner using it seamless and simple. Some of the data sources it supports are data files, MySQL, Oracle, Salesforce, and Big Query. By integrating different data stores, the platform can also overcome the problem of data silos and provide a holistic view of available data. Some of the organizations using SiSense include Nasdaq, Gitlab, Phillips, and Tinder.

Understanding How SiSense Works

SiSense is a web-based data analytics platform whereby servers are used for data ingestion, processing, and user interaction. The SiSense Server can be installed on a single machine or deployed in a cluster. Connecting to the data sources is handled by the ElastiCube Data Hub in two modes, self-managed Live Models or proprietary, super-fast ElastiCube Models. The ElastiCube Server and Application server handle data management and processing. The Web Server provides business users a way to interact with SiSense via a web app, mobile app, and REST API. Client apps handle various essential tasks like data source management, server management, distributed messaging, and node orchestration.

sisense db
image from sisense.com

SiSense provides users with different ways of utilizing it for BI including manual installation and/or cloud deployment. Data engineers handle data connections and management to create data models ready for analysis. Data developers use the modes to carry out ad-hoc analysis, develop custom BI solutions and create UI  artifacts to be presented to business users.

What is Presto?

presto logo

Presto was created by Facebook to handle the huge amounts of data it generates every minute. Like Hadoop, it could carry out distributed and parallel processing over numerous nodes. However, rather than writing intermediate results to disk, it holds them in memory. This allows for data processing in a matter of seconds instead of hours or days more common in Hadoop jobs. 

Distributed Data Analysis Using SiSense and Presto

The recommended way of using SiSense for big data and near real-time data analysis is multi-node deployment. This allows for optimal performance and efficiency, especially when using ElastiCube models. Combining SiSense with Presto allows the creation of highly scalable BI solutions with better performance that are easier to create.

Users benefit by having two distributed systems that are each highly optimized for the tasks they need to handle. Using Presto for data management provides access to more data stores and having dedicated and easily separable clusters handling data management. It is also highly optimized for handling data management and query execution due to its parallel and distributed architecture. It simplifies data management as it offers a single source of truth by creating a virtual data warehouse in which SiSense can use as its data source. SiSense then handles BI and analysis.

Screen Shot 2021 05 05 at 3.17.59 PM

Ahana Cloud is the cloud-native SaaS managed service for Presto, see how you can turbocharge SiSense in 30 minutes!

Get Started with Presto & SiSense

Collaborative Data Science Using Mode SQL And Presto

Ahana Cloud for Presto is a SaaS Managed Service which speeds up the performance of leading business intelligence, data visualization, and SQL tools like Mode SQL.

Mode SQL In A Minute

ogimage mode logo

Mode is a modern collaborative data science platform meant to help data scientists with diverse skillsets work effectively both individually and as a team. By supporting the use of SQL, Python, and R, Mode SQL enables users to use the language they are most comfortable working with or combine them as needed. Support for multiple users helps teams collaborate on projects and increases productivity. Created by Mode Analytics, it is an online platform meant to offer simplicity, power, and versatility to data scientists and analysts. This has resulted in more than 50% of the Fortune 500 companies using the platform. Some firms using Mode include VMWare, Bloomberg, Everlane, InVision, and Reddit.

Mode SQL

Among the products the platform offers is the Mode SQL Editor, which uses SQL to carry out data analysis online and to share the results with business users. To be able to use the platform, one either registers for the free plan that supports 5 users or the business or enterprise plans that offer more features, handle larger volumes of data, and support more users. Users provide connections to the data store they wish to use from the various databases supported. 

slide reports
image from mode.com

Data scientists and analysts then create definitions and run queries against the connected database. Advanced logic like looping and logic statements can be added through Liquid. The results are used to create reports and dashboards using Mode chart builder or piped to notebooks for further analysis. Workspaces, member management, and report management can be handled programmatically through APIs. Combined with support for multiple databases, these features greatly simplify the data science workflow. Some of the data stores Mode SQL support include MariaDB, PostgreSQL, TimescaleDB, Snowflake, Vertica, Redshift, and Presto.

Why Use Presto

logo presto

Using Presto increases the performance of the developed solutions due to its massively parallel processing (MPP) architecture. It enables users to process big data efficiently and improves response times. The use of SQL greatly reduces the entry barrier for beginners. This enables users to carry out analysis and gain valuable insights from the data faster, improving productivity and performance.

Data Science With Mode SQL and Presto

The Mode platform can be combined with the Presto query engine to provide users with a high-performance stack for developing custom data science solutions. Presto is able to federate many databases, providing both a virtualized data lake and access to more data sources than Mode connects to. Therefore, users have access to all their data in a centralized place which they can connect to using Mode SQL as Presto supports SQL. Results of the query can be processed with more powerful languages offering features absent in SQL on the platform using Liquid, Python, and/or R.

To use Mode SQL with Presto, one needs to have a presto database server/cluster running. One then connects to the database instance and writes SQL queries against it. The results of the queries are saved and used to build dashboards and reports to be accessed by end-users.

Screen Shot 2021 05 05 at 3.16.20 PM

Ahana Cloud is the cloud-native SaaS managed service for Presto, see how you can turbocharge Mode in 30 minutes!

Get Started with Presto & Mode SQL

using apache superset

Turbocharge your BI and Analytics Using Apache Superset and Presto

Ahana Cloud for Presto is a SaaS Managed Service which speeds up the performance of leading business intelligence, data visualization, and SQL tools like Apache Superset.

Interactive, ad hoc queries and lightning fast data visualizations for Apache Superset

Using Apache superset

What is Apache Superset? Apache Superset is a big data exploration and visualization platform that was created at an Airbnb hackathon in 2015. Since then, it has grown into an enterprise-level, open-source business intelligence (BI) system offering features found in commercial solutions like Looker and Tableau. It is a highly customizable BI system, able to visualize and explore huge amounts of data and is meant for users of all skills. Some of the companies using Superset include Ubisoft, Dropbox, Yahoo, and American Express.

Using Superset, one can use a GUI-based approach to create visualizations and perform data exploration or SQL, allowing ease of use for both casual and technical users. It allows for near real-time data analysis of huge amounts of data and supports geospatial data analytics. Superset can connect to many different SQL-based databases to enable easy creation of charts, interactive dashboards, geospatial visualizations, and data exploration using SQL or drag-and-drop among other features. Some of the databases supported include PostgreSQL, Oracle, Redshift, BigQuery, DB2, and Presto (not technically a database).

Anatomy of Apache Superset

Superset consists of a python-based backend and a javascript frontend. The backend handles connecting to different databases and carrying out data processing while the frontend handles data visualization and user interactions. The backend uses different connectors and SQLAlchemy to connect to different databases, Pandas for data analysis, and Flask for the web server. Optional visualization caching  is supported using software like Memcached and Redis.  A React/Redux architecture is utilized to create the frontend.

While Redash can use direct connections to individual supported data stores. Another way is to connect Redash to a distributed query engine like Presto, to enable higher performance, higher concurrent workloads and instant, seamless access to multiple data sources. With Redash + Presto, a distributed SQL query engine, it can become an even more powerful tool.

What is Presto?

what is presto used for

Just like Superset, Presto was created for developing big data solutions. Presto can connect to more than just SQL-based data sources and is customized for distributed and parallel data querying. It then exposes these data sources, even S3 based data lakes as if they were SQL compliant. Using Superset with Presto enables the creation of a highly decoupled and scalable BI solution.

Presto is an open source SQL query engine for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data. Data platform teams are increasingly using Presto as the de facto SQL query engine to run analytics across data sources in-place. This means that Presto can query data where it is stored, without needing to move data into a separate analytics system. 

Faster Queries and Unified Access to more Data with Open Source Apache Superset and Presto

alternatives to apache superset

+

How do i use presto

Presto is one of the data sources that Superset supports. By combining them, developers and organizations can leverage the features of two open source, distributed, and highly scalable systems to meet their BI intelligence needs. A Superset cluster can be used to carry out data visualization and exploration while a Presto cluster can be used to connect to disparate data sources. This allows BI to be carried out using non-SQL data sources like HDFS and NoSQL databases.

Connecting them is usually done using pyhive, given that the default Presto connector is the hive connector. This is easily done using the command pip install pyhive. One then provides the URL to the Presto data source using the format hive://hive@host:port/database. Custom BI solutions can be created using SQL that is passed to Presto, which handles the actual query processing. Query results are passed to Superset for data analysis and visualization.

If you want to get started with Presto, you can check out Ahana Cloud. It’s SaaS for Presto and offers a fully managed service for Presto in the cloud. Ahana Cloud comes pre-integrated with Apache Superset.

Alternative to Athena

With a few clicks you can add Superset to your Presto cluster.

Related Articles

Using JMeter with Presto

Apache JMeter is an open source application written in Java that is designed for load testing. This article presents how to install it, and how to create and run a test plan for testing SQL workloads on Presto clusters.

5 reasons why you should use PrestoDB

Data platform teams are increasingly using the federated SQL query engine PrestoDB to run such analytics for a variety of use cases across a wide range of data lakes and databases in-place, without the need to move data. Here’s a look at some important characteristics of Presto that account for its growing adoption.

Turbocharge Business Intelligence Using Redash and Presto SQL query engine

Ahana Cloud for Presto is a SaaS Managed Service which speeds up the performance of leading business intelligence, data visualization, and SQL tools like Redash.

Interactive, ad hoc queries and faster data visualizations for Redash

redash logo

What is Redash? Redash is an open source business intelligence (BI) system that uses SQL for data analysis and visualization. It enables the collaborative creation of visualizations and dashboards from different data sources using SQL. The data sources are not restricted to SQL conformant databases but include NoSQL databases like MongoDB and object storage systems like AWS S3. The visualizations, dashboards, reusable snippets, and other Redash artifacts are then shared either internally or publicly with other users to provide insight and support data-driven decision-making. This allows organizations and individuals to strategize and make decisions based on the latest data available to them. Some enterprise companies using Redash include Mozilla, SoundCloud, Atlassian, and Cloudflare.

Advantages of Using Redash

There are many advantages of using Redash for data analytics and visualization. Being open source, it can be installed for free and modified as per each specific use case with no restrictions. It can be installed and hosted free manually or accessed via premium cloud offerings, preventing vendor lock-in. The use of SQL for carrying out data processing makes it accessible to many users who already know SQL. Redash’s web-based interface and collaborative nature make it easy to create, share and use BI artifacts across diverse teams and departments.

redash dashboard
image from redash.io

Redash Internals

Redash consists of a javascript frontend and python backend. The frontend is based on React while older versions are based on AngularJS. The frontend handles user interaction and has the SQL editor, dashboards, and visualizations. The backend consists of a Flask web server that provides a REST API and a PostgreSQL database that handles data caching. Queries are handled using Celery workers responsible for actual data processing and connecting to the various data sources. Redash is packaged as a single-page app (SPA) web-based system with high scalability and availability.

One way Redash is used is via direct connections to individual supported data stores. The other way is to connect Redash to a distributed query engine like Presto, to enable higher performance, higher concurrent workloads and instant, seamless access to multiple data sources. With Redash + Presto, a distributed SQL query engine, it can become an even more powerful tool.

What is Presto?

logo presto

Presto is a massively parallel processing (MPP) query engine meant to interact with different data sources and process data extremely fast. This is achieved by storing data and intermediate results in-memory rather than writing them to disk as Hadoop does. Using Redash and Presto enables the separation of the BI and querying systems. Different people thus can use, optimize, and manage them independently.

How Redash works with Presto

redash logo

+

logo presto

Combining Redash and Presto allows the development of free and distributed BI systems as both of them are open source. Typically, this is achieved by having a frontend Redash cluster that communicates with a backend Presto cluster. The backend handles query processing and data management while the frontend provides the user interface. 

Ahana Cloud customer Cartona uses Redash with Presto and Ahana Cloud to power its dashboards. Learn more about their use case in their presentation.

Screen Shot 2021 04 21 at 8.26.54 PM
PrestoCon Day, 2020 Talk by eCommerce company Cartona

Having Presto as the query engine provides better performance and access to more data sources via the Presto connectors.

The architecture consists of Redash connected to a presto cluster with one or more connected data sources. Presto handles the data access and in-memory processing of queries. Redash handles the visualization of reports and dashboards. One configures Redash to connect to the Presto cluster by setting the name value to Presto and providing the other properties like host, port, and catalog as appropriate. This allows the presto cluster to be scaled by adding or removing processing nodes to meet the requirements of the Redash users. Users are then able to run queries against the data sources easily as both Presto and Redash use an SQL interface. Integrating them offers other advantages such as data federation, fast query processing, and being able to have different clusters that can be optimized to best meet the needs of business analysts and data scientists.

image 6

Ahana Cloud is the cloud-native SaaS managed service for Presto, see how you can turbocharge Redash in 30 minutes!

Get Started with Presto & Redash

Turbocharge Zeppelin with the lightning-fast Presto SQL query engine

What is Zeppelin?

zeppelin classic logo

Zeppelin is an open-source tool for producing data science notebooks that are then stored as .json files. 

This kind of analytics software is called a notebook. The interface of Zeppelin consists of two different kinds of spaces which are both called notes. You can use the first kind to type in your code, free text, or markdown-formatted annotations. Once you run the code, results appear underneath as an output.  

The outputs can be charts, query results, or a markdown preview. In Zeppelin, you can organize such notes — code and results sections — into dashboards by stacking them or placing them next to each other. 

With Zeppelin, you not only create data pipelines, but you can also visualize them as graphs to explain your ETL setup to others. If you want to make Zeppelin usable for your organization (like the marketing or sales teams, for example), you can build dynamic forms such as query templates. The users would only fill in the form and do not need to worry about the syntax and the correct structure of their queries.

Zeppelin works with a wide range of programming languages that are used in data analytics and data engineering. Among them are SQL, Python, Scala, shell, and markdown. This makes Zeppelin a popular tool for data scientists and engineers working with Apache data processing tools and frameworks. 

Zeppelin offers multi-user management for collaborative work on big projects. Companies use Zeppelin to generate large amounts of structured and unstructured data that sometimes has to be stored in different databases and data lakes. You can pull data from them all into one notebook using different interpreters. There is no need to create multiple notebooks.

For querying across Zeppelin and other data sources, many have turned to Presto, an open-source distributed query engine.

What Is Presto?

logo presto

Presto is an open source distributed SQL query engine that you can connect to many different data sources like databases, cloud storage, cloud storage, NoSQL databases, and many more. Many use Presto for their data lake analytics needs, as well as federated querying across a range of data sources.

A Presto cluster consists of a single coordinator and several worker nodes. The worker nodes are responsible for connecting to various sources and transparently carrying out query processing in a distributed and parallel manner. The computational power of a cluster can thus be increased by adding the number of worker nodes. This has made it an efficient choice for organizations with different data formats and sources and/or a large amount of data to process.

How Presto Works With Zeppelin

With Presto, you can query structured and unstructured data at once inside one Zeppelin note.

To integrate Presto with Zeppelin and across other data sources, you will need to add Presto as a new interpreter in Zeppelin. This can be done without any code as Zeppelin has an interface feature for it. You add a JDBC connector configured for Presto. Then, simply start writing a SQL query in a new note preceding it with the %jdbdc(presto) command to tell the note which interpreter to use. 

Since Presto uses SQL, you won’t have to invest time into learning another niche programming language. Thus, speed, scalability, and familiarity make Presto a popular tool for real-time analytics on Zeppelin.

Turbocharge Trifacta with the lightning-fast Presto SQL query engine

Ahana Cloud for Presto is a SaaS Managed Service which speeds up the performance of leading business intelligence, data visualization, and SQL tools like Trifacta.

Interactive, ad hoc queries and faster data exploration for Trifacta

trifacta

What is Trifacta? Trifacta is a cloud-based data engineering front end tool that is used by data analysts and data scientists to leverage an organization’s data. It enables data collection and consolidation, preparation, data transformation, and creation of data pipelines to be performed easily. These data pipelines can then be used for different use cases like business intelligence (BI), machine learning, and data analytics.

Trifacta is a cloud-native solution meant to make it easy to use data in different formats and structures for tasks like data analysis and machine learning. This is achieved by simplifying the tasks of data cleaning, data exploration, data preparation, data validation, and creating data pipelines since the user can use a visual interface. It is supported on different cloud platforms including AWS, Google Cloud, and Microsoft Azure.

trifacta screen
image from trifacta.com

Trifacta is used to create data engineering pipelines for data exploration in a cloud environment. One way Trifacta is used is via direct connections to individual supported data stores. The other way is to connect Trifacta to a distributed query engine like Presto, to enable higher performance, higher concurrent workloads and instant, seamless access to multiple data sources.

What Is Presto?

logo presto

Presto is an in-memory distributed query engine with a connector-based architecture to disparate data sources like S3 cloud storage, relational, and NoSQL databases. It was developed by Facebook to enable them with lightning-fast query responses from their HDFS data lake / warehouse. Since then, it has been adopted by other hundreds of other companies including Uber, Twitter, Amazon, and Alibaba.

A Presto cluster consists of a single coordinator and several worker nodes. The worker nodes are responsible for connecting to various sources and transparently carrying out query processing in a distributed and parallel manner. The computational power of a cluster can thus be increased by adding the number of worker nodes. This has made it an efficient choice for organizations with different data formats and sources and/or a large amount of data to process.

Faster Queries and Unified Access to more Data using Trifacta and PrestoDB

trifacta

+

logo presto

As organizations become more data driven, both end-user computing and data access requirements are increasing. The solutions therefore must have high performance and be scalable to meet the demands placed on today’s data platform teams, who must respond quickly. This calls for the adoption of an open, flexible, distributed architecture. Combining Trifacta and Presto enables organizations to create a highly scalable, distributed, and modern data engineering platform.

A typical architecture consists of Trifacta connected to a presto cluster with one or more connected data sources. Presto handles the data access and in-memory processing of queries. Trifacta handles the visualization, reporting, and data wrangling tasks. This allows the presto cluster to be scaled by adding or removing processing nodes to meet the requirements of the Trifacta users. Integrating them offers other benefits such as data federation, fast query processing, and being able to have different clusters that can be optimized to best meet the needs of data engineering workloads.

image 1

Ahana Cloud is the cloud-native SaaS managed service for Presto, see how you can turbocharge Trifacta in only 30 minutes!

Getting Started is Easy and Free

Turbocharge Tableau with the lightning-fast Presto SQL query engine

Ahana Cloud for Presto is a SaaS Managed Service which speeds up the performance of leading business intelligence, data visualization, and SQL tools like Tableau.

Interactive, ad hoc queries and faster data visualizations for Tableau

Tableau Logo for website

What is Tableau? Tableau is a data visualization tool that can query a wide range of data sources like relational databases, cloud databases, spreadsheets, data warehouses, and much more.

Tableau allows people to see and understand their data and provides business intelligence. Users can combine data from various data sources for analysis in dashboards. You can also manage Tableau in your own environment or with a managed service. It offers an interface suitable for both visualization beginners and experts.

tableau dashboard
image from tableau.com

Tableau is capable of creating visualizations with sophisticated data analysis. Apart from building charts, Tableau helps to uncover deeply hidden relationships in the data. It also allows you to do calculations right inside the visualization layer, which makes it easier to adjust on the fly.

Consequently, Tableau is used regularly by academia and business professionals in different fields. Data scientists use Tableau to get insights on data spread across their business. Engineers use Tableau to quickly plot data and automate reporting for other business units.. Business analysts create self-service reports for top managers.

One way Tableau is used is via direct connections to individual supported data stores. The other way is to connect Tableau to a distributed query engine like Presto, to enable higher performance, higher concurrent workloads and instant, seamless access to multiple data sources. With Tableau + Presto, a distributed SQL query engine, it can become an even more powerful tool.

What is Presto?

logo presto

Great analytics requires great amounts of data and a speedy query engine. This is where Presto comes into play. 

It is called a distributed query engine since uses a cluster of scale-out worker instances. It’s called a federated engine because it allows you to fetch data from multiple sources inside the same query. Furthermore, Presto works with both relational and non-relational databases. Even petabytes of data won’t slow down its performance and prevent it from delivering results within seconds.

Presto is open-source and hosted under the Linux Foundation license. Anyone can download it from a GitHub repository and use it.

A few things make Presto stand out. More often than not, to query your raw data, you would need to move it to a new warehouse first. Presto works in-memory, querying the data directly where you store it. It creates an abstraction layer to transform data on the fly and dispatch it to you.

Presto runs on Hadoop(HDFS), S3-based cloud data lakes, and supports a wide variety of data sources. Presto’s architecture is close to being a massively parallel processing (MPP) database management system but is more advanced. 

Presto has multiple worker nodes that are orchestrated by a coordinator node. The worker nodes fetch your data across sources, and the coordinator plans the execution of your query and delivers the results.

Faster Queries and Unified Access to more Data using Tableau and PrestoDB

Tableau Logo for website

+

logo presto

Traditionally, Business Intelligence tools, such as Tableau, are optimized for structured data queries against data warehouses. These BI tools were unable to support big data sources efficiently. It made for a slow (and sometimes expensive) process with a lot of manual workflows.

It’s also a manual process to connect Tableau to your various data sources, and you have to manage each one. If you connect Tableau to Presto, then you can run your queries with Presto which will query all of your data sources with one query, without the need to manage all of those connections. The performance is much better as well, especially when you have many users. Combining Tableau and Presto enables organizations to create a highly scalable, distributed, and modern data engineering platform.

You can connect to Presto from the Connectors menu directly or by using an SQL statement. Once connected, you will be able to access live Presto data within Tableau. You can also publish completed workbooks with Presto data.

A typical architecture consists of Tableau connected to a presto cluster with one or more connected data sources. Presto handles the data access and in-memory processing of queries. Tableau handles the visualization of reports and dashboards. This allows the presto cluster to be scaled by adding or removing processing nodes to meet the requirements of the Tableau users. Integrating them offers other benefits such as data federation, fast query processing, and being able to have different clusters that can be optimized to best meet the needs of business analysts and data scientists.

image 2

Ahana Cloud is the cloud-native SaaS managed service for Presto, see how you can turbocharge Tableau in 30 minutes!

Getting Started is Easy and Free

Presto and AWS S3

What is AWS S3?

s3

Amazon Simple Storage Service (Amazon S3) is storage for the internet. Amazon S3 is used to store and retrieve any amount of data at any time, from anywhere on the web, by using the web interface of the AWS Management Console.

What is Presto?

logo presto

PrestoDB is a federated SQL query engine for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data, which continues to grow exponentially across a wide range of data lakes and databases. Many organizations are adopting Presto as a single engine to query against all available data sources. Data platform teams are increasingly using Presto as the de facto SQL query engine to run analytics across data sources in-place. This means that Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. 

Why use Presto with AWS S3?

Analysts get better performance at a lower cost by using S3 with Presto, as users can scale their workloads quickly and automatically. Presto allows users to quickly query both unstructured and structured data. Presto is an ideal workload in the cloud because the cloud provides performance, scalability, reliability, availability, and massive economies of scale. You can launch a Presto cluster in minutes, without needing to worry about node provisioning, cluster setup, Presto configuration, or cluster tuning.

Presto executes queries over data sets that are provided by plugins known as Connectors. Integrating Presto with S3 provides users with several features:

  • Presto, running on Amazon EMR, allows developers and analysts to easily run interactive SQL queries that directly access data stored in Amazon S3 for data-exploration, ad-hoc analysis and reporting.
  • The Hive connector allows Presto to query data stored in S3-compatible engines and registered in a Hive Metastore (HMS). 
  • Data transfer between a Presto cluster and S3 is fully parallelized.
  • Presto can be easily deployed using the AWS Serverless platform, with no servers, virtual machines, or clusters to set up, manage or tune. 

Since Presto is based on ANSI SQL, it’s very straightforward to start using it. The Presto connector architecture enables the federated access of almost any data source, whether a database, data lake or other data system. Presto can start from one node and scale to thousands. With Presto, users can use SQL to run ad hoc queries whenever you want, wherever your data resides. Presto allows users to query data where it’s stored so they don’t have to ETL data into a separate system. With an Amazon S3 connector, platform teams can simply point to their data on Amazon S3, define the schema, and start querying using the built-in query editor, or with their existing Business Intelligence (BI) tools. With Presto and S3, you can mine the treasures in your data quickly, and use your data to acquire new insights for your business and customers.

For users who are ready to use Presto to query their AWS S3 but don’t want to worry about the complexities or overhead of managing Presto, you can use a managed service like Ahana Cloud. Ahana Cloud gives you the power of Presto without having to get under the hood. It’s a managed service for AWS and has out-of-the-box integrations with AWS S3, in addition to AWS Glue and Hive Metastore (HMS).

AhanaCloudArchitecture

You can try Ahana Cloud out free for 14 days, sign up and get started today.

Presto and Amazon Redshift

What is Amazon Redshift?

redshift

Amazon Redshift is a cloud data warehouse application service used by data analysts or data warehouse engineers for analyzing data using standard SQL and your existing Business Intelligence (BI) tools. Users can start with just a few hundred gigabytes of data and scale to a petabyte or more.

What is Presto?

logo presto

Presto is a federated SQL query engine for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data, which continues to grow exponentially across a wide range of data lakes and databases. Many organizations are adopting Presto as a single engine to query against all available data sources. Data platform teams are increasingly using Presto as the de facto SQL query engine to run analytics across data sources in-place. This means that Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. 

Why Presto and Amazon Redshift?

Analysts get better performance at a lower cost by using the Presto Redshift stack, as users can scale their workloads quickly and automatically. Presto allows users to quickly query both unstructured and structured data. Presto is an ideal workload in the cloud because the cloud provides performance, scalability, reliability, availability, and massive economies of scale. You can launch a Presto cluster in minutes, without needing to worry about node provisioning, cluster setup, Presto configuration, or cluster tuning.

Presto executes queries over data sets that are provided by plugins known as Connectors. Integrating Presto with Redshift provides users with new capabilities:

  • Presto reads data directly from HDFS, so you don’t need to perform ETL on the data. Presto has also been extended to operate over different kinds of data sources including traditional relational databases and other data sources such as Redshift.
  • The Redshift connector allows users to query and create tables in an external Amazon Redshift cluster. Users can join data between different systems like Redshift and Hive, or between two different Redshift clusters. Since Presto on Amazon EMR supports spot instances, the total cost of running a data platform is lower.
  • Presto can reduce query execution time. Presto provides the ability to run queries in seconds instead of hours, and analysts can iterate quickly on innovative hypotheses with the interactive exploration of any dataset, residing anywhere.

Since Presto is based on ANSI SQL, it’s very straightforward to start using it. The Presto connector architecture enables the federated access of almost any data source, whether a database, data lake, or other data system. Presto can start from one node and scale to thousands. With Presto, users can use SQL to run ad hoc queries whenever you want, wherever your data resides. Presto allows users to query data where it’s stored so you don’t have to ETL data into a separate system. With a Redshift connector, platform teams can quickly provide access to datasets that analysts are interested in while being able to scale and meet the growing requirements for analytics. With Presto and Redshift, you can mine the treasures in your data quickly, and use your data to acquire new insights for your business and customers.

For more information about Ahana – The easiest managed service for Presto on AWS click here.

Turbocharge Qlik with the lightning-fast Presto SQL query engine

Ahana Cloud for Presto is a SaaS Managed Service which speeds up the performance of leading business intelligence, data visualization, and SQL tools like Qlik.

Faster data visualizations & interactive, ad hoc queries for Qlik

qlik vector logo

What is Qlik? Qlik is a platform providing data visualization and BI for end-to-end analysis of data. It offers interactive and user-customizable data visualizations, data federations, data analytics, and business intelligence (BI). Qlik enables its users to use data from different sources to carry out interactive data analysis and visualization. Organizations can leverage the available data for day-to-day decision-making and daily operational processes, promoting a data-driven culture. Some of the companies using Qlik include Samsung, Toyota, HSBC, Merck, and Shell.

Qlik comes in two versions, a Windows application and a cloud offering. Qlik can also be deployed on multiple cloud environments. It provides an association between data that could potentially reside in different data sources and carrying out analytics in-memory for improved performance. This provides fast analytics and a unified view of an organization’s data.

The Architecture of a Qlik App

A Qlik app consists of data source, back-end, and front-end components. The data source end is responsible for handling data access between the app the various data sources used. The back-end component container contains the Publisher which is responsible for pulling data from the different sources and providing the view files to a Qlik Server. It is mainly used by software developers, system analysts, database administrators, and other tech personnel.

qlik water app x large
image from qlik.com

The front-end accesses the views provided by the server and presents it as an interactive web app. It leverages the associative model used by Qlik to make selections, the main way of carrying out data analysis. The main users are organizational employees in different departments including marketing, sales, finance, management, and logistics.

What Is Presto?

logo presto

Presto is a SQL query engine originally developed by Facebook to replace Hive, enabling to quickly access the social media platform’s insights against huge amounts of data. It is meant to connect to different data sources and to perform queries against them in a parallel and distributed manner. Presto is an open source project, housed by The Linux Foundation’s Presto Foundation. The connector-based architecture allows disparate data sources like S3 cloud storage, relational, and NoSQL databases to be queried. Many hundreds of other companies have since adopted Presto including Uber, Twitter, Amazon, and Alibaba.

A Presto cluster consists of a single coordinator and several worker nodes. The worker nodes are responsible for connecting to various sources and transparently carrying out query processing in a distributed and parallel approach. The computational power of a cluster can thus be increased by increasing the number of worker nodes. This has made it a lightning-fast choice for organizations with different data formats and sources and/or a large amount of data to process.

Faster Queries and Unified Access to more Data using Qlik and Presto

qlik vector logo

+

logo presto

By using Presto, an organization can create clusters running Qlik applications targeted at different use cases. These use cases include functions such as data aggregation, data analysis, ad-hoc querying, and data visualization. It also helps an organization develop a unified view of the data. Qlik apps only have to interface with one data service, Presto, while being able to query multiple data sources. This is because Presto abstracts the different data sources being used by the apps.

The common approach is to deploy different Presto clusters for different use cases. The Qlik apps then connect to the clusters, relying on the clusters to perform actual data querying. Being an in-memory distributed query engine, PrestoDB can process large amounts in very short periods. The use of a cluster architecture means that horizontal scaling can be easily and efficiently carried out.

Combining Qlik and Presto enables organizations to create a highly scalable, distributed, and modern data engineering platform.

A typical architecture consists of Qlik connected to a presto cluster which, in turn, is connected to one or more data sources. Presto handles the data access and in-memory processing of queries. Qlik handles the visualization, reporting, and dashboarding. This allows the presto cluster to be scaled by adding or removing processing nodes to meet the workloads of the Qlik users. Integrating them offers other benefits such as data federation, fast query processing, and being able to have different clusters that can be optimized to best meet the needs of business analysts and data scientists.

image 3

Ahana Cloud is the cloud-native SaaS managed service for Presto, see how you can turbocharge Qlik in 30 minutes!

Getting Started is Easy and Free

Presto and MySQL

What is MySQL?

mysql

MySQL is a popular open-source relational database management system (RDBMS) that data engineers use to organize data into tables. MySQL supports foreign keys to create relationships between tables. You can create, modify and extract data from the relational database using SQL commands.

MySQL is easy and intuitive to use; you only have to know a few SQL statements to use it, like SHOW DATABASES, CREATE DATABASE, USE, CREATE TABLE, SHOW TABLES, DESCRIBE, INSERT INTO … VALUES,  SELECT, WHERE, DELETE, ALTER TABLE…ADD, ALTER TABLE…DROP, and SHOW CREATE TABLE. 

MySQL has a client-server architecture, which means that many clients can communicate with the MySQL server to run queries, create tables, etc. You can scale with ease; MySQL is fully multithreaded and handles large data volumes. It supports over 20 platforms like Linux, Windows, UNIX, OS, etc. You can roll back transactions, commit and undo MySQL statements using the ROLLBACK statement. It has memory efficiency. You can partition your data, which improves query performance.

MySQL Use Cases

Data engineers don’t typically use MySQL to run queries against massive data volumes. Instead, more common use cases might include: 

•      Small web-based databases: an eCommerce store, a web-page database, and an app database (with a small data volume)

•      Online analytical processing and online transaction processing systems (OLAP/OLTP) (with small data volume)

•      Building a business intelligence tool that doesn’t have a massive data volume

If you want to query big data with your MySQL database, you can use PrestoDB.

What is PrestoDB?

logo presto

PrestoDB is an open source SQL query engine to query data lakes and across data sources, without having to move your data. Presto runs queries directly on files stored in databases like MySQL. It joins multiple databases from different data sources like MySQL, Hadoop, Cassandra, etc.

Using MySQL and Presto together

Data engineers commonly use MySQL and PrestoDB for joining different systems like Hive, Cassandra, and MySQL and running queries across those data sources; it’s common to use partitioning to run queries efficiently.   

You can use MySQL and PrestoDB together to run queries against a massive data volume. PrestoDB works directly on files stored in MySQL storage.  

PrestoDB has a MySQL connector to run queries and create tables in your MySQL database. You also can join different MySQL databases or different systems like MySQL and Hive. 

PrestoDB is used to map the data in the MySQL storage engine to schemas and tables. Data engineers use the Hive metastore service to access the metadata catalog that lives in the MySQL database.

You create a catalog properties file named “catalog_name.properties” in “etc/catalog” to mount the MySQL connector. This file must have the connector.name, connection-url, connection-user, and the connection-password. You can have as many catalog files as you want.

In PrestoDB, we have schemas that contain database objects like tables. So, we have one schema for each database. To create, modify and extract data on your MySQL database using PrestoDB, you can use the next statements: 

•      USE dbname: it selects the database

•      CREATE DATABASE dbname: it creates a database

•      CREATE TABLE table_name: it creates a table

•      LOAD DATA: it loads data from a file

•      INTO TABLE table_name: it adds rows to a table

•      SHOW SCHEMAS FROM catalog_name: it shows all databases from that catalog. 

•      SHOW TABLES FROM catalog_name.dbname: it shows all tables from the database (dbname). 

•      DESCRIBE catalog_name.dbname.table_name: it shows a list of columns in the table (table_name) in the database (dbname).

•      SHOW COLUMNS FROM  catalog_name.dbname.table_name: it shows a list of columns in the table (table_name) in the database (dbname).

•      SELECT * FROM catalog_name.dbname.table_name: it shows the table (table_name) in the database (dbname). 

If you’re ready to use Presto to query MySQL but don’t want to worry about the complexities or overhead of managing Presto, you can use a managed service like Ahana Cloud. Ahana Cloud gives you the power of Presto without having to get under the hood. It’s a managed service for AWS and can integrate with MySQL.

AhanaCloud Architecture 2

Get started with a 14-day free trial of Ahana Cloud.

Presto and PostgreSQL

What is PostgreSQL?

postgresql logo

PostgreSQL is an enterprise-grade open source database management system (DBMS) that is free and offers advanced functionality common to more expensive dbmses. It is an object-relational database that supports multiple programming languages, custom data types, JSON, relational modeling, and table partitioning. Some of the companies using PostgreSQL in production include Instagram, Apple, Cisco, and Red Hat.

PostgreSQL’s wide usage and support is a result of several reasons. Being open source, the database comes for free and benefits from contributions from a wide developer community. Due to its wide usage, there are many talented and experienced people who are already conversant with it. Therefore, there is no shortage of skilled personnel who can implement and maintain data management systems powered by PostgreSQL. Features such as table partitioning, replication, and geospatial data support mean it has features normally found in closed source enterprise dbmses that are expensive to acquire and run.

PostgreSQL is highly extensible and scalable and is used to manage massive datasets reliably. Support for foreign data wrappers ensures that it integrates easily to data streams and other databases. This makes it ideal for serving as the data storage system for online analytical processing (OLAP), advanced analytics, and BI solutions. Installation is simple since it is free to download. Additionally, free docker containers and paid cloud solutions exist to support a cloud-based microservices architecture for even better horizontal scalability of the developed systems.

What Is Presto?

logo presto

Presto is a query engine used to query different data stores in differing formats in a distributed and parallel manner. It was originally developed by Facebook to manage its data warehouse developed using Apache Hive. 

Presto architecture consists of a coordinator node and multiple worker nodes. The coordinator node is responsible for accepting queries and returning results. The worker nodes do the actual computation and connect to the data stores. This distributed architecture makes Presto fast and scalable.

Leveraging PostgreSQL and Presto

Combining PostgreSQL and PrestoDB offers many benefits to developers and organizations. The advanced database features like geospatial data support, full text search, and support for different programming languages combined with being open source make it highly customizable. Developers are therefore free to configure it as they deem best for their systems. Using Presto enables better solutions as it is also highly customizable, open source, and distributed. Worker nodes can integrate easily with multiple nodes running the database as Presto ships with a free PostgreSQL plugin.

PrestoDB works by using the plugin to connect to postgreSQL. The plugin is responsible for generating SQL statements that are compatible with the database. Worker nodes use the plugins to communicate with the dbms to issue commands and fetch data. By pushing down as much work as possible to the database, Presto is able to leverage the power and features of PostgreSQL. This makes them a great combination for developing big data solutions. Some of the use cases of Presto and PostgreSQL are machine learning, data analysis, data visualization, and business intelligence.

If you’re ready to use Presto to query PostgreSQL but don’t want to worry about the complexities or overhead of managing Presto, you can use a managed service like Ahana Cloud. Ahana Cloud gives you the power of Presto without having to get under the hood. It’s a managed service for AWS and can integrate with PostgreSQL.

AhanaCloud Architecture 2

Get started with a 14-day free trial of Ahana Cloud.

Turbocharge Looker with the lightning-fast Presto SQL query engine

Ahana Cloud for Presto is the SaaS Managed Service which speeds up the performance of leading business intelligence, data visualization, and SQL tools like Looker.

Interactive, ad hoc queries and faster data visualizations for Looker

looker logo meta v0005

What is Looker? Looker is a business intelligence (BI) platform that helps individuals, businesses, and organizations get better insights from available data. Acquired by Google in 2019 and part of GCP (Google Cloud Platform), it helps users to explore data and find meaningful trends and patterns and in data visualization. It is available on GCP and AWS as a cloud solution. It can be used to create custom solutions too as needed. Looker can connect to different databases including BigQuery, MySQL, PostgreSQL, Oracle, PrestoDB, and Vertica, and supports more than 50 different SQL dialects.

Looker users include software developers, data scientists, marketers, and management for different use cases. These include business intelligence, data analytics, and supporting  data-driven decision-making from different data sources. Its power lies in being able to interact with different data sources to create a holistic view of the available data to different users and departments. This makes it easy to manage workflows in the whole project or organization in a data-driven manner.

Since the users do not have to learn SQL, everyone in a project/organization can customize their reports, dashboards, heatmaps, and other BI presentation assets easily. Furthermore, Looker goes beyond offering rudimentary BI to including support for fresh data, machine learning, performance monitoring and ETL optimizations.

How Looker Works Under the Hood

Looker is a cloud-based platform that can be deployed to offer advanced BI. It works by using LookML to generate SQL code that is sent over a connection to a given database. This allows Looker to connect to different databases for business intelligence purposes.

A Looker project consists of one or more models that provide a connection to a database. A model can contain one or more Explores that provide an interactive webpage where a user can dynamically work with the data. 

looker screenshot
image from looker.com

Explores contain views that map to the tables in the given database. Users are then able to issue queries to the database, filter their results, and visualize them in a variety of ways.

While Looker can be used is via direct connections to individual supported data stores, another way is to connect Looker to a distributed query engine like Presto, to enable higher performance, higher concurrent workloads and instant, seamless access to multiple data sources. With Looker + Presto, Looker can become an even more powerful tool.

What is Presto?

logo presto

Presto is a distributed query engine that allows in-memory processing of data from disparate data sources. It was developed by Facebook and is in use in other large companies including Uber, Twitter, Alibaba, and Amazon as a defacto standard for their SQL workloads.

Faster Result Sets and Unified Access to more Data using Looker and PrestoDB

looker logo meta v0005

+

logo presto

Looker is an extremely powerful BI platform for a front-end facing app. Combining it with a presto-based back-end application enables cutting down query times to seconds rather than minutes. To create a Looker and Presto system, you need to first deploy your presto application. You then create a connection, selecting the dialect as PrestoDB.

Looker then relies on Presto to carry out query processing. The queries are run in a distributed manner and provide access to multiple data sources. Also, this provides a single source of truth in regards to the overall data model. This kind of configuration is used to provide highly scalable enterprise-level BI solutions by using Looker for BI and Presto for distributed data querying.

A typical architecture consists of Tableau connected to a presto cluster with one or more connected data sources. Presto handles the data access and in-memory processing of queries. Tableau handles the visualization of reports and dashboards. This allows the presto cluster to be scaled by adding or removing processing nodes to meet the requirements of the Tableau users. Integrating them offers other benefits such as data federation, fast query processing, and being able to have different clusters that can be optimized to best meet the needs of business analysts and data scientists.

image 5

Ahana Cloud is the cloud-native SaaS managed service for Presto, see how you can turbocharge Tableau in 30 minutes!

Ready to Turbocharge your Looker with Presto?

Turbocharge Jupyter with the lightning-fast Presto SQL query engine

What is Jupyter?

518px Jupyter logo.svg

Jupyter Notebook is open-source software for creating shareable documents and reports. Each such document can easily host a whole working data pipeline. The .ipynb files can contain code chunks, free and markdown-formatted text, and even visual objects for data analysis.

Jupyter became particularly popular among data scientists and engineers using Python because it works with most Python packages. A Jupyter Notebook can quickly digest and process significant amounts of data, and you just need a couple of command lines to install it using a command shell. 

While Jupyter has no drag-and-drop elements, it still offers a few simple no-code features. The software is called “notebook” due to its interface. In a cell, you can type in a chunk of code, run it, and see the results printed to a small console underneath. Then you add another chunk and repeat the operation, getting the results appended under the second chunk. Since you have your code and all outputs in one place, ordered chronologically, it is very similar to making notes with a paper block.

juptyer
image from jupyter.org

Jupyter works well for data analysis or data pipeline drafting and testing. If you need some data insights and a report based on it, you can add your comments and annotations directly into the Notebook, run the Kernel, and export the whole document as an HTML file to be sent over. You can also build a data pipeline, load and process your data, export it into a database as an Excel file, or using one of the many other methods available. Once you are satisfied that it runs smoothly and delivers adequate results, you can deploy the pipeline in your main system.

Jupyter has a fairly steep learning curve. Fortunately, its contributors have put together thorough documentation to help new users and enable troubleshooting.

What is Presto

logo presto

Presto is an open source SQL query engine used to query data lakes and other data stores in differing formats in a distributed and parallel manner. It was originally developed by Facebook to manage its data warehouse developed using Apache Hive. 

The Presto architecture consists of a coordinator node and multiple worker nodes. The coordinator node is responsible for accepting queries and returning results. The worker nodes do the actual computation and connect to the data stores. This distributed architecture makes Presto fast and scalable.

Faster Queries and Unified Access to more Data using Jupyter and Presto

You can use Presto with Jupyter Notebook to expand your data pools.

To connect to your Presto engine, you can use the Presto connector from the pyhive package. All Python-based connectors use similar logic. In the case of Presto, you need to enter your host, post, and then add your catalog and schema. After the connection has been established, you can run SQL queries to fetch the data.

With Presto, you can query both relational and non-relational databases and other data sources. Moreover, you can combine data from more than one data source inside only one query. Presto is also performant – calculations on terabytes of data takes minutes. Presto sends one query to multiple worker nodes using a coordinator node, distributing the workload. The coordinator will collect the data once it’s ready and dispatch it to you.