ahana + cube_integration

Announcing the Cube integration with Ahana: Querying multiple data sources with managed Presto and Cube

See how Ahana and Cube work together to help you set up a Presto cluster and build a single source of truth for metrics without spending days reading cryptic docs

Ahana provides managed Presto clusters running in your AWS account.

Presto is an open-source distributed SQL query engine, originally developed at Facebook, now hosted under the Linux Foundation. It connects to multiple databases or other data sources (for example, Amazon S3). We can use a Presto cluster as a single compute engine for an entire data lake.

Presto implements the data federation feature: you can process data from multiple sources as if they were stored in a single database. Because of that, you don’t need a separate ETL (Extract-Transform-Load) pipeline to prepare the data before using it. However, running and configuring a single-point-of-access for multiple databases (or file systems) requires Ops skills and an additional effort.

However, no data engineer wants to do the Ops work. Using Ahana, you can deploy a Presto cluster within minutes without spending hours configuring the service, VPCs, and AWS access rights. Ahana hides the burden of infrastructure management and allows you to focus on processing your data.

What is Cube?

Cube is a headless BI platform for accessing, organizing, and delivering data. Cube connects to many data warehouses, databases, or query engines, including Presto, and allows you to quickly build data applications or analyze your data in BI tools. It serves as the single source of truth for your business metrics.

This article will demonstrate the caching functionality, access control, and flexibility of the data retrieval API.

Integration

Cube’s battle-tested Presto driver provides the out-of-the-box connectivity to Ahana.

You just need to provide the credentials: Presto host name and port, user name and password, Presto catalog and schema. You’ll also need to set CUBEJS_DB_SSL to true since Ahana has secures Presto connections with SSL.

Check the docs to learn more about connecting Cube to Ahana.

Example: Parsing logs from multiple data sources with Ahana and Cube

Let’s build a real-world data application with Ahana and Cube.

We will use Ahana to join Amazon Sagemaker Endpoint logs stored as JSON files in S3 with the data retrieved from a PostgreSQL database.

Suppose you work at a software house specializing in training ML models for your clients and delivering ML inference as a REST API. You have just trained new versions of all models, and you would like to demonstrate the improvements to the clients.

Because of that, you do a canary deployment of the versions and gather the predictions from the new and the old models using the built-in logging functionality of AWS Sagemaker Endpoints: a managed deployment environment for machine learning models. Additionally, you also track the actual production values provided by your clients.

You need all of that to prepare personalized dashboards showing the results of your hard work.

Let us show you how Ahana and Cube work together to help you achieve your goal quickly without spending days reading cryptic documentation.

You will retrieve the prediction logs from an S3 bucket and merge them with the actual values stored in a PostgreSQL database. After that, you calculate the ML performance metrics, implement access control, and hide the data source complexity behind an easy-to-use REST API.

Architecture diagram

In the end, you want a dashboard looking like this:

The final result: two dashboards showing the number of errors made by two variants of the ML model

The final result: two dashboards showing the number of errors made by two variants of the ML model

How to configure Ahana?

Allowing Ahana to access your AWS account

First, let’s login to Ahana, and connect it to your AWS account. We must create an IAM role allowing Ahana to access our AWS account.

On the setup page, click the “Open CloudFormation” button. After clicking the button, we get redirected to the AWS page for creating a new CloudFormation stack from a template provided by Ahana. Create the stack and wait until CloudFormation finishes the setup.

When the IAM role is configured, click the stack’s Outputs tab and copy the AhanaCloudProvisioningRole key value.

The Outputs tab containing the identifier of the IAM role for Ahana

We have to paste it into the Role ARN field on the Ahana setup page and click the “Complete Setup” button.

The Ahana setup page

Creating an Ahana cluster

After configuring AWS access, we have to start a new Ahana cluster.

In the Ahana dashboard, click the “Create new cluster” button.

Ahana create new cluster

In the setup window, we can configure the type of the AWS EC2 instances used by the cluster, scaling strategy, and the Hive Metastore. If you need a detailed description of the configuration options, look at the “Create new cluster” section of the Ahana documentation.

Ahana cluster setup page

Remember to add at least one user to your cluster! When we are satisfied with the configuration, we can click the “Create cluster” button. Ahana needs around 20-30 minutes to setup a new cluster.

Retrieving data from S3 and PostgreSQL

After deploying a Presto cluster, we have to connect our data sources to the cluster because, in this example, the Sagemaker Endpoint logs are stored in S3 and PostgreSQL.

Adding a PostgreSQL database to Ahana

In the Ahana dashboard, click the “Add new data source” button. We will see a page showing all supported data sources. Let’s click the “Amazon RDS for PostgreSQL” option.

In the setup form displayed below, we have to provide the database configuration and click the “Add data source” button.

PostgreSQL data source configuration

Adding an S3 bucket to Ahana

AWS Sagemaker Endpoint stores their logs in an S3 bucket as JSON files. To access those files in Presto, we need to configure the AWS Glue data catalog and add the data catalog to the Ahana cluster.

We have to login to the AWS console, open the AWS Glue page and add a new database to the data catalog (or use an existing one).

AWS Glue databases

Now, let’s add a new table. We won’t configure it manually. Instead, let’s create a Glue crawler to generate the table definition automatically. On the AWS Glue page, we have to click the “Crawlers” link and click the “Add crawler” button.

AWS Glue crawlers

After typing the crawler’s name and clicking the “Next” button, we will see the Source Type page. On this page, we have to choose the”Data stores” and “Crawl all folders” (in our case, “Crawl new folders only” would work too).

Here we specify where the crawler should look for new data
Here we specify where the crawler should look for new data

On the “Data store” page, we pick the S3 data store, select the S3 connection (or click the “Add connection” button if we don’t have an S3 connection configured yet), and specify the S3 path.

Note that Sagemaker Endpoints store logs in subkeys using the following key structure: endpoint-name/model-variant/year/month/day/hour. We want to use those parts of the key as table partitions.

Because of that, if our Sagemaker logs have an S3 key: s3://the_bucket_name/sagemaker/logs/endpoint-name/model-variant-name/year/month/day/hour, we put only the s3://the_bucket_name/sagemaker/logs key prefix in the setup window!

IAM role configuration

Let’s click the “Next” button. In the subsequent window, we choose “No” when asked whether we want to configure another data source. Glue setup will ask about the name of the crawler’s IAM role. We can create a new one:

Next, we configure the crawler’s schedule. A Sagemaker Endpoint adds new log files in near real-time. Because of that, it makes sense to scan the files and add new partitions every hour:

configuring the crawler's schedule

In the output configuration, we need to customize the settings.

First, let’s select the Glue database where the new tables get stored. After that, we modify the “Configuration options.”

We pick the “Add new columns only” because we will make manual changes in the table definition, and we don’t want the crawler to overwrite them. Also, we want to add new partitions to the table, so we check the “Update all new and existing partitions with metadata from the table.” box.

Crawler's output configuration

Let’s click “Next.” We can check the configuration one more time in the review window and click the “Finish” button.

Now, we can wait until the crawler runs or open the AWS Glue Crawlers view and trigger the run manually. When the crawler finishes running, we go to the Tables view in AWS Glue and click the table name.

AWS Glue tables

In the table view, we click the “Edit table” button and change the “Serde serialization lib” to “org.apache.hive.hcatalog.data.JsonSerDe” because the AWS JSON serialization library isn’t available in the Ahana Presto cluster.

JSON serialization configured in the table details view

We should also click the “Edit schema” button and change the default partition names to values shown in the screenshot below:

Default partition names replaced with their actual names

After saving the changes, we can add the Glue data catalog to our Ahana Presto cluster.

Configuring data sources in the Presto cluster

Go back to the Ahana dashboard and click the “Add data source” button. Select the “AWS Glue Data Catalog for Amazon S3” option in the setup form.

AWS Glue data catalog setup in Ahana

Let’s select our AWS region and put the AWS account id in the “Glue Data Catalog ID” field. After that, we click the “Open CloudFormation” button and apply the template. We will have to wait until CloudFormation creates the IAM role.

When the role is ready, we copy the role ARN from the Outputs tab and paste it into the “Glue/S3 Role ARN” field:

The "Outputs" tab shows the ARN of the IAM role used to access the Glue data catalog from Ahana
The “Outputs” tab shows the ARN of the IAM role used to access the Glue data catalog from Ahana

On the Ahana setup page, we click the “Add data source” button.

Adding data sources to an existing cluster

Finally, we can add both data sources to our Ahana cluster.

We have to open the Ahana “Clusters” page, click the “Manage” button, and scroll down to the “Data Sources‚” section. In this section, we click the “Manage data sources” button.

We will see another setup page where we check the boxes next to the data sources we want to configure and click the “Modify cluster” button. We will need to confirm that we want to restart the cluster to make the changes.

Adding data sources to an Ahana cluster

Writing the Presto queries

The actual structure of the input and output from an AWS Sagemaker Endpoint depends on us. We can send any JSON request and return a custom JSON object.

Let’s assume that our endpoint receives a request containing the input data for the machine learning model and a correlation id. We will need those ids to join the model predictions with the actual data.

Example input:

{"time_series": [51, 37, …, 7], "correlation_id": "cf8b7b9a-6b8a-45fe-9814-11a4b17c710a"}

In the response, the model returns a JSON object with a single “prediction”key and a decimal value:

{"prediction": 21.266147618448954}

A single request in Sagemaker Endpoint logs looks like this:

{"captureData": {"endpointInput": {"observedContentType": "application/json", "mode": "INPUT", "data": "eyJ0aW1lX3NlcmllcyI6IFs1MS40MjM5MjAzODYxNTAzODUsIDM3LjUwOTk2ODc2MTYwNzM0LCAzNi41NTk4MzI2OTQ0NjAwNTYsIDY0LjAyMTU3MzEyNjYyNDg0LCA2MC4zMjkwMzU2MDgyMjIwODUsIDIyLjk1MDg0MjgxNDg4MzExLCA0NC45MjQxNTU5MTE1MTQyOCwgMzkuMDM1NzA4Mjg4ODc2ODA1LCAyMC44NzQ0Njk2OTM0MzAxMTUsIDQ3Ljc4MzY3MDQ3MjI2MDI1NSwgMzcuNTgxMDYzNzUyNjY5NTE1LCA1OC4xMTc2MzQ5NjE5NDM4OCwgMzYuODgwNzExNTAyNDIxMywgMzkuNzE1Mjg4NTM5NzY5ODksIDUxLjkxMDYxODYyNzg0ODYyLCA0OS40Mzk4MjQwMTQ0NDM2OCwgNDIuODM5OTA5MDIxMDkwMzksIDI3LjYwOTU0MTY5MDYyNzkzLCAzOS44MDczNzU1NDQwODYyOCwgMzUuMTA2OTQ4MzI5NjQwOF0sICJjb3JyZWxhdGlvbl9pZCI6ICJjZjhiN2I5YS02YjhhLTQ1ZmUtOTgxNC0xMWE0YjE3YzcxMGEifQ==", "encoding": "BASE64"}, "endpointOutput": {"observedContentType": "application/json", "mode": "OUTPUT", "data": "eyJwcmVkaWN0aW9uIjogMjEuMjY2MTQ3NjE4NDQ4OTU0fQ==", "encoding": "BASE64"}}, "eventMetadata": {"eventId": "b409a948-fbc7-4fa6-8544-c7e85d1b7e21", "inferenceTime": "2022-05-06T10:23:19Z"}

AWS Sagemaker Endpoints encode the request and response using base64. Our query needs to decode the data before we can process it. Because of that, our Presto query starts with data decoding:

with sagemaker as (
  select
  model_name,
  variant_name,
  cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.data)), '$.correlation_id') as varchar) as correlation_id,
  cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.data)), '$.prediction') as double) as prediction
  from s3.sagemaker_logs.logs
)
, actual as (
  select correlation_id, actual_value
  from postgresql.public.actual_values
)

After that, we join both data sources and calculate the absolute error value:

sql
, logs as (
  select model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as actual
  from sagemaker
  left outer join actual
  on sagemaker.correlation_id = actual.correlation_id
)
, errors as (
  select abs(prediction - actual) as abs_err, model_name, model_variant from logs
),

Now, we need to calculate the percentiles using the `approx_percentile` function. Note that we group the percentiles by model name and model variant. Because of that, Presto will produce only a single row per every model-variant pair. That’ll be important when we write the second part of this query.

percentiles as (
  select approx_percentile(abs_err, 0.1) as perc_10,
  approx_percentile(abs_err, 0.2) as perc_20,
  approx_percentile(abs_err, 0.3) as perc_30,
  approx_percentile(abs_err, 0.4) as perc_40,
  approx_percentile(abs_err, 0.5) as perc_50,
  approx_percentile(abs_err, 0.6) as perc_60,
  approx_percentile(abs_err, 0.7) as perc_70,
  approx_percentile(abs_err, 0.8) as perc_80,
  approx_percentile(abs_err, 0.9) as perc_90,
  approx_percentile(abs_err, 1.0) as perc_100,
  model_name,
  model_variant
  from errors
  group by model_name, model_variant
)

In the final part of the query, we will use the filter expression to count the number of values within buckets. Additionally, we return the bucket boundaries. We need to use an aggregate function max (or any other aggregate function) because of the group by clause. That won’t affect the result because we returned a single row per every model-variant pair in the previous query.

SELECT count(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10
, max(perc_10) as perc_10_value
, count(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20
, max(perc_20) as perc_20_value
, count(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30
, max(perc_30) as perc_30_value
, count(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40
, max(perc_40) as perc_40_value
, count(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50
, max(perc_50) as perc_50_value
, count(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60
, max(perc_60) as perc_60_value
, count(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70
, max(perc_70) as perc_70_value
, count(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80
, max(perc_80) as perc_80_value
, count(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90
, max(perc_90) as perc_90_value
, count(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100
, max(perc_100) as perc_100_value
, p.model_name, p.model_variant
FROM percentiles p, errors e group by p.model_name, p.model_variant

How to configure Cube?

In our application, we want to display the distribution of absolute prediction errors.

We will have a chart showing the difference between the actual value and the model’s prediction. Our chart will split the absolute errors into buckets (percentiles) and display the number of errors within every bucket.

If the new variant of the model performs better than the existing model, we should see fewer large errors in the charts. A perfect (and unrealistic) model would produce a single error bar in the left-most part of the chart with the “0” label.

At the beginning of the article, we looked at an example chart that shows no significant difference between both model variants:

example chart_Both models perform almost the same

If the variant B were better than the variant A, its chart could look like this (note the axis values in both pictures

An improved second version of the model_example chart

Creating a Cube deployment

Cube Cloud is the easiest way to get started with Cube. It provides a fully managed, ready to use Cube cluster. However, if you prefer self-hosting, then follow this tutorial.

First, please create a new Cube Cloud deployment. Then, open the “Deployments” page and click the “Create deployment” button.

Cube Deployments dashboard page

We choose the Presto cluster:

Database connections supported by Cube

Finally, we fill out the connection parameters and click the “Apply”button. Remember to enable the SSL connection!

Presto configuration page

Defining the data model in Cube

We have our queries ready to copy-paste, and we have configured a Presto connection in Cube. Now, we can define the Cube schema to retrieve query results.

Let’s open the Schema view in Cube and add a new file.

The schema view in Cube showing where we should click to create a new file

In the next window, type the file name errorpercentiles.js and click “Create file.”

In the following paragraphs, we will explain parts of the configuration and show you code fragments to copy-paste. You don’t have to do that in such small steps!

Below, you see the entire content of the file. Later, we explain the configuration parameters.

const measureNames = [
  'perc_10', 'perc_10_value',
  'perc_20', 'perc_20_value',
  'perc_30', 'perc_30_value',
  'perc_40', 'perc_40_value',
  'perc_50', 'perc_50_value',
  'perc_60', 'perc_60_value',
  'perc_70', 'perc_70_value',
  'perc_80', 'perc_80_value',
  'perc_90', 'perc_90_value',
  'perc_100', 'perc_100_value',
];

const measures = Object.keys(measureNames).reduce((result, name) => {
  const sqlName = measureNames[name];
  return {
    ...result,
    [sqlName]: {
      sql: () => sqlName,
      type: `max`
    }
  };
}, {});

cube('errorpercentiles', {
  sql: `with sagemaker as (
    select
    model_name,
    variant_name,
    cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.data)), '$.correlation_id') as varchar) as correlation_id,
    cast(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.data)), '$.prediction') as double) as prediction
    from s3.sagemaker_logs.logs
  )
, actual as (
  select correlation_id, actual_value
  from postgresql.public.actual_values
)
, logs as (
  select model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as actual
  from sagemaker
  left outer join actual
  on sagemaker.correlation_id = actual.correlation_id
)
, errors as (
  select abs(prediction - actual) as abs_err, model_name, model_variant from logs
),
percentiles as (
  select approx_percentile(abs_err, 0.1) as perc_10,
  approx_percentile(abs_err, 0.2) as perc_20,
  approx_percentile(abs_err, 0.3) as perc_30,
  approx_percentile(abs_err, 0.4) as perc_40,
  approx_percentile(abs_err, 0.5) as perc_50,
  approx_percentile(abs_err, 0.6) as perc_60,
  approx_percentile(abs_err, 0.7) as perc_70,
  approx_percentile(abs_err, 0.8) as perc_80,
  approx_percentile(abs_err, 0.9) as perc_90,
  approx_percentile(abs_err, 1.0) as perc_100,
  model_name,
  model_variant
  from errors
  group by model_name, model_variant
)
SELECT count(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10
, max(perc_10) as perc_10_value
, count(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20
, max(perc_20) as perc_20_value
, count(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30
, max(perc_30) as perc_30_value
, count(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40
, max(perc_40) as perc_40_value
, count(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50
, max(perc_50) as perc_50_value
, count(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60
, max(perc_60) as perc_60_value
, count(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70
, max(perc_70) as perc_70_value
, count(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80
, max(perc_80) as perc_80_value
, count(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90
, max(perc_90) as perc_90_value
, count(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100
, max(perc_100) as perc_100_value
, p.model_name, p.model_variant
FROM percentiles p, errors e group by p.model_name, p.model_variant`,

preAggregations: {
// Pre-Aggregations definitions go here
// Learn more here: https://cube.dev/docs/caching/pre-aggregations/getting-started
},

joins: {
},

measures: measures,
dimensions: {
  modelVariant: {
    sql: `model_variant`,
    type: 'string'
  },
  modelName: {
    sql: `model_name`,
    type: 'string'
  },
}
});

In the sql property, we put the query prepared earlier. Note that your query MUST NOT contain a semicolon.

A newly created cube configuration file

We will group and filter the values by the model and variant names, so we put those columns in the dimensions section of the cube configuration. The rest of the columns are going to be our measurements. We can write them out one by one like this:


measures: {
  perc_10: {
    sql: `perc_10`,
    type: `max`
  },
  perc_20: {
    sql: `perc_20`,
    type: `max`
  },
  perc_30: {
    sql: `perc_30`,
    type: `max`
  },
  perc_40: {
    sql: `perc_40`,
    type: `max`
  },
  perc_50: {
    sql: `perc_50`,
    type: `max`
  },
  perc_60: {
    sql: `perc_60`,
    type: `max`
  },
  perc_70: {
    sql: `perc_70`,
    type: `max`
  },
  perc_80: {
    sql: `perc_80`,
    type: `max`
  },
  perc_90: {
    sql: `perc_90`,
    type: `max`
  },
  perc_100: {
    sql: `perc_100`,
    type: `max`
  },
  perc_10_value: {
    sql: `perc_10_value`,
    type: `max`
  },
  perc_20_value: {
    sql: `perc_20_value`,
    type: `max`
  },
  perc_30_value: {
    sql: `perc_30_value`,
    type: `max`
  },
  perc_40_value: {
    sql: `perc_40_value`,
    type: `max`
  },
  perc_50_value: {
    sql: `perc_50_value`,
    type: `max`
  },
  perc_60_value: {
    sql: `perc_60_value`,
    type: `max`
  },
  perc_70_value: {
    sql: `perc_70_value`,
    type: `max`
  },
  perc_80_value: {
    sql: `perc_80_value`,
    type: `max`
  },
  perc_90_value: {
    sql: `perc_90_value`,
    type: `max`
  },
  perc_100_value: {
    sql: `perc_100_value`,
    type: `max`
  }
},
dimensions: {
  modelVariant: {
    sql: `model_variant`,
    type: 'string'
  },
  modelName: {
    sql: `model_name`,
    type: 'string'
  },
}
A part of the error percentiles configuration in Cube

The notation we have shown you has lots of repetition and is quite verbose. We can shorten the measurements defined in the code by using JavaScript to generate them.

We had to add the following code before using the cube function!

First, we have to create an array of column names:


const measureNames = [
  'perc_10', 'perc_10_value',
  'perc_20', 'perc_20_value',
  'perc_30', 'perc_30_value',
  'perc_40', 'perc_40_value',
  'perc_50', 'perc_50_value',
  'perc_60', 'perc_60_value',
  'perc_70', 'perc_70_value',
  'perc_80', 'perc_80_value',
  'perc_90', 'perc_90_value',
  'perc_100', 'perc_100_value',
];

Now, we must generate the measures configuration object. We iterate over the array and create a measure configuration for every column:


const measures = Object.keys(measureNames).reduce((result, name) => {
  const sqlName = measureNames[name];
  return {
    ...result,
    [sqlName]: {
      sql: () => sqlName,
      type: `max`
    }
  };
}, {});

Finally, we can replace the measure definitions with:

measures: measures

After changing the file content, click the “Save All” button.

The top section of the schema view

And click the Continue button in the popup window.

The popup window shows the URL of the test API

In the Playground view, we can test our query by retrieving the chart data as a table (or one of the built-in charts):

An example result in the Playground view

Configuring access control in Cube

In the Schema view, open the cube.js file.

We will use the queryRewrite configuration option to allow or disallow access to data.

First, we will reject all API calls without the models field in the securityContext. We will put the identifier of the models the user is allowed to see in their JWT token. The security context contains all of the JWT token variables.

For example, we can send a JWT token with the following payload. Of course, in the application sending queries to Cube, we must check the user’s access right and set the appropriate token payload. Authentication and authorization are beyond the scope of this tutorial, but please don’t forget about them.

The Security Context window in the Playground view

After rejecting unauthorized access, we add a filter to all queries.

We can distinguish between the datasets accessed by the user by looking at the data specified in the query. We need to do it because we must filter by the modelName property of the correct table.

In our queryRewrite configuration in the cube.js file, we use the query.filter.push function to add a modelName IN (model_1, model_2, ...) clause to the SQL query:

module.exports = {
  queryRewrite: (query, { securityContext }) => {
    if (!securityContext.models) {
      throw new Error('No models found in Security Context!');
    }
    query.filters.push({
      member: 'percentiles.modelName',
      operator: 'in',
      values: securityContext.models,
    });
    return query;
  },
};

Configuring caching in Cube

By default, Cube caches all Presto queries for 2 minutes. Even though Sagemaker Endpoints stores logs in S3 in near real-time, we aren’t interested in refreshing the data so often. Sagemaker Endpoints store the logs in JSON files, so retrieving the metrics requires a full scan of all files in the S3 bucket.

When we gather logs over a long time, the query may take some time. Below, we will show you how to configure the caching in Cube. We recommend doing it when the end-user application needs over one second to load the data.

For the sake of the example, we will retrieve the value only twice a day.

Preparing data sources for caching

First, we must allow Presto to store data in both PostgreSQL and S3. It’s required because, in the case of Presto, Cube supports only the simple pre-aggregation strategy. Therefore, we need to pre-aggregate the data in the source databases before loading them into Cube.

In PostgreSQL, we grant permissions to the user account used by Presto to access the database:

GRANT CREATE ON SCHEMA the_schema_we_use TO the_user_used_in_presto;
GRANT USAGE ON SCHEMA the_schema_we_use TO the_user_used_in_presto;

If we haven’t modified anything in the AWS Glue data catalog, Presto already has permission to create new tables and store their data in S3, but the schema doesn’t contain the target S3 location yet, so all requests will fail.

We must login to AWS Console, open the Glue data catalog, and create a new database called prod_pre_aggregations. In the database configuration, we must specify the S3 location for the table content.

If you want to use a different database name, follow the instructions in our documentation.

Caching configuration in Cube

Let’s open the errorpercentiles.js schema file. Below the SQL query, we put the preAggregations configuration:

preAggregations: {
  cacheResults: {
    type: `rollup`,
    measures: [
      errorpercentiles.perc_10, errorpercentiles.perc_10_value,
      errorpercentiles.perc_20, errorpercentiles.perc_20_value,
      errorpercentiles.perc_30, errorpercentiles.perc_30_value,
      errorpercentiles.perc_40, errorpercentiles.perc_40_value,
      errorpercentiles.perc_50, errorpercentiles.perc_50_value,
      errorpercentiles.perc_60, errorpercentiles.perc_60_value,
      errorpercentiles.perc_70, errorpercentiles.perc_70_value,
      errorpercentiles.perc_80, errorpercentiles.perc_80_value,
      errorpercentiles.perc_90, errorpercentiles.perc_90_value,
      errorpercentiles.perc_100, errorpercentiles.perc_100_value
    ],
    dimensions: [errorpercentiles.modelName, errorpercentiles.modelVariant],
    refreshKey: {
      every: `12 hour`,
    },
  },
},

After testing the development version, we can also deploy the changes to production using the “Commit & Push”button. When we click it, we will be asked to type the commit message:

An empty “Commit Changes & Push”view

When we commit the changes, the deployment of a new version of the endpoint will start. A few minutes later, we can start sending queries to the endpoint.

We can also check the pre-aggregations window to verify whether Cube successfully created the cached data.

Successfully cached pre-aggregations

Now, we can move to the Playground tab and run our query. We should see the “Query was accelerated with pre-aggregation”message if Cube used the cached values to handle the request.

The message that indicates that our pre-aggregation works correctly

Building the front-end application

Cube can connect to a variety of tools, including Jupyter Notebooks, Superset, and Hex. However, we want a fully customizable dashboard, so we will build a front-end application.

Our dashboard consists of two parts: the website and the back-end service. In the web part, we will have only the code required to display the charts. In the back-end, we will handle authentication and authorization. The backend service will also send requests to the Cube REST API.

Getting the Cube API key and the API URL

Before we start, we have to copy the Cube API secret. Open the settings page in Cube Cloud’s web UI and click the “Env vars”tab. In the tab, you will see all of the Cube configuration variables. Click the eye icon next to the CUBEJS_API_SECRET and copy the value.

The Env vars tab on the settings page

We also need the URL of the Cube endpoint. To get this value, click the “Copy API URL” link in the top right corner of the screen.

The location of the Copy API URL link

Back end for front end

Now, we can write the back-end code.

First, we have to authenticate the user. We assume that you have an authentication service that verifies whether the user has access to your dashboard and which models they can access. In our examples, we expect those model names in an array stored in the allowedModels variable.

After getting the user’s credentials, we have to generate a JWT to authenticate Cube requests. Note that we have also defined a variable for storing the CUBE_URL. Put the URL retrieved in the previous step as its value.

‚Äã‚Äãconst jwt = require('jsonwebtoken');
CUBE_URL = '';
function create_cube_token() {
  const CUBE_API_SECRET = your_token; // Don’t store it in the code!!!
  // Pass it as an environment variable at runtime or use the
  // secret management feature of your container orchestration system

  const cubejsToken = jwt.sign(
    { "models": allowedModels },
    CUBE_API_SECRET,
    { expiresIn: '30d' }
  );
  
  return cubejsToken;
}

We will need two endpoints in our back-end service: the endpoint returning the chart data and the endpoint retrieving the names of models and variants we can access.

We create a new express application running in the node server and configure the /models endpoint:

const request = require('request');
const express = require('express')
const bodyParser = require('body-parser')
const port = 5000;
const app = express()

app.use(bodyParser.json())
app.get('/models', getAvailableModels);

app.listen(port, () => {
  console.log(`Server is running on port ${port}`)
})

In the getAvailableModels function, we query the Cube Cloud API to get the model names and variants. It will return only the models we are allowed to see because we have configured the Cube security context:

Our function returns a list of objects containing the modelName and modelVariant fields.

function getAvailableModels(req, res) {
  res.setHeader('Content-Type', 'application/json');
  request.post(CUBE_URL + '/load', {
    headers: {
      'Authorization': create_cube_token(),
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({"query": {
      "dimensions": [
        "errorpercentiles.modelName",
        "errorpercentiles.modelVariant"
      ],
      "timeDimensions": [],
      "order": {
        "errorpercentiles.modelName": "asc"
      }
    }})
  }, (err, res_, body) => {
    if (err) {
      console.log(err);
    }
    body = JSON.parse(body);
    response = body.data.map(item => {
      return {
        modelName: item["errorpercentiles.modelName"],
        modelVariant: item["errorpercentiles.modelVariant"]
      }
    });
    res.send(JSON.stringify(response));
  });
};

Let’s retrieve the percentiles and percentile buckets. To simplify the example, we will show only the query and the response parsing code. The rest of the code stays the same as in the previous endpoint.

The query specifies all measures we want to retrieve and sets the filter to get data belonging to a single model’s variant. We could retrieve all data at once, but we do it one by one for every variant.

{
  "query": {
    "measures": [
      "errorpercentiles.perc_10",
      "errorpercentiles.perc_20",
      "errorpercentiles.perc_30",
      "errorpercentiles.perc_40",
      "errorpercentiles.perc_50",
      "errorpercentiles.perc_60",
      "errorpercentiles.perc_70",
      "errorpercentiles.perc_80",
      "errorpercentiles.perc_90",
      "errorpercentiles.perc_100",
      "errorpercentiles.perc_10_value",
      "errorpercentiles.perc_20_value",
      "errorpercentiles.perc_30_value",
      "errorpercentiles.perc_40_value",
      "errorpercentiles.perc_50_value",
      "errorpercentiles.perc_60_value",
      "errorpercentiles.perc_70_value",
      "errorpercentiles.perc_80_value",
      "errorpercentiles.perc_90_value",
      "errorpercentiles.perc_100_value"
    ],
    "dimensions": [
        "errorpercentiles.modelName",
        "errorpercentiles.modelVariant"
    ],
    "filters": [
      {
        "member": "errorpercentiles.modelName",
        "operator": "equals",
        "values": [
          req.query.model
        ]
      },
      {
        "member": "errorpercentiles.modelVariant",
        "operator": "equals",
        "values": [
          req.query.variant
        ]
      }
    ]
  }
}

The response parsing code extracts the number of values in every bucket and prepares bucket labels:

response = body.data.map(item => {
  return {
    modelName: item["errorpercentiles.modelName"],
    modelVariant: item["errorpercentiles.modelVariant"],
    labels: [
      "<=" + item['percentiles.perc_10_value'],
      item['errorpercentiles.perc_20_value'],
      item['errorpercentiles.perc_30_value'],
      item['errorpercentiles.perc_40_value'],
      item['errorpercentiles.perc_50_value'],
      item['errorpercentiles.perc_60_value'],
      item['errorpercentiles.perc_70_value'],
      item['errorpercentiles.perc_80_value'],
      item['errorpercentiles.perc_90_value'],
      ">=" + item['errorpercentiles.perc_100_value']
    ],
    values: [
      item['errorpercentiles.perc_10'],
      item['errorpercentiles.perc_20'],
      item['errorpercentiles.perc_30'],
      item['errorpercentiles.perc_40'],
      item['errorpercentiles.perc_50'],
      item['errorpercentiles.perc_60'],
      item['errorpercentiles.perc_70'],
      item['errorpercentiles.perc_80'],
      item['errorpercentiles.perc_90'],
      item['errorpercentiles.perc_100']
    ]
  }
})

Dashboard website

In the last step, we build the dashboard website using Vue.js.

If you are interested in copy-pasting working code, we have prepared the entire example in a CodeSandbox. Below, we explain the building blocks of our application.

We define the main Vue component encapsulating the entire website content. In the script section, we will download the model and variant names. In the template, we iterate over the retrieved models and generate a chart for all of them.

We put the charts in the Suspense component to allow asynchronous loading.

To keep the example short, we will skip the CSS style part.

<script setup>
  import OwnerName from './components/OwnerName.vue'
  import ChartView from './components/ChartView.vue'
  import axios from 'axios'
  import { ref } from 'vue'
  const models = ref([]);
  axios.get(SERVER_URL + '/models').then(response => {
    models.value = response.data
  });
</script>

<template>
  <header>
    <div class="wrapper">
      <OwnerName name="Test Inc." />
    </div>
  </header>
  <main>
    <div v-for="model in models" v-bind:key="model.modelName">
      <Suspense>
        <ChartView v-bind:title="model.modelName" v-bind:variant="model.modelVariant" type="percentiles"/>
      </Suspense>
    </div>
  </main>
</template>

The OwnerName component displays our client’s name. We will skip its code as it’s irrelevant in our example.

In the ChartView component, we use the vue-chartjs library to display the charts. Our setup script contains the required imports and registers the Chart.js components:

Äãimport { Bar } from 'vue-chartjs'
import { Chart as ChartJS, Title, Tooltip, Legend, BarElement, CategoryScale, LinearScale } from 'chart.js'
import { ref } from 'vue'
import axios from 'axios'
ChartJS.register(Title, Tooltip, Legend, BarElement, CategoryScale, LinearScale);

We have bound the title, variant, and chart type to the ChartView instance. Therefore, our component definition must contain those properties:

const props = defineProps({
  title: String,
  variant: String,
  type: String
})

Next, we retrieve the chart data and labels from the back-end service. We will also prepare the variable containing the label text:

const response = await axios.get(SERVER_URL + '/' + props.type + '?model=' + props.title + '&variant=' + props.variant)
const data = response.data[0].values;
const labels = response.data[0].labels;
const label_text = "Number of prediction errors of a given value"

Finally, we prepare the chart configuration variables:

const chartData = ref({
  labels: labels,
  datasets: [
    {
      label: label_text,
      backgroundColor: '#f87979',
      data: data
    }
  ],
});

const chartOptions = {
  plugins: {
    title: {
      display: true,
      text: props.title + ' - ' + props.variant,
    },
  },
  legend: {
    display: false
  },
  tooltip: {
    enabled: false
  }
}

In the template section of the Vue component, we pass the configuration to the Bar instance:

<template>
  <Bar ref="chart" v-bind:chart-data="chartData" v-bind:chart-options="chartOptions" />
</template>

If we have done everything correctly, we should see a dashboard page with error distributions.

Charts displaying the error distribution for different model variants

Wrapping up

Thanks for following this tutorial.

We encourage you to spend some time reading the Cube and Ahana documentation.

Please don’t hesitate to like and bookmark this post, write a comment, give Cube a star on GitHub, join Cube’s Slack community, and subscribe to the Ahana newsletter.

Athena Limitations & AWS Athena Limits | Comparing AWS Athena

Welcome to our blog series on comparing AWS Athena, a serverless Presto service, to open source PrestoDB. In this series we’ll discuss Amazon’s Athena service versus PrestoDB and some of the reasons why you might choose to deploy PrestoDB on your own instead of using the AWS Athena service. We hope you find this series helpful.

AWS Athena is an interactive query service built on PrestoDB that developers use to query data stored in Amazon S3 using standard SQL. It has a serverless architecture and Athena users pay per query (it’s priced at $5 per terabyte scanned). Some of the common Amazon Athena limits are technical limitations that include query limits, concurrent queries limits, and partition limits. AWS Athena limits performance, as it runs slowly and increases operational costs. Plus, AWS Athena is built on an old version of PrestoDB and only supports a subset of PrestoDB features.

An overview on AWS Athena limits

AWS Athena query limits can cause problems, and many data engineering teams have spent hours trying to diagnose them. Some limits are hard, while some are soft quotas that you can request AWS to increase. One big limitation is around Athena’s limitations on queries: Athena users can only submit one query at a time and can only run up to five queries simultaneously for each account by default.

AWS Athena query limits

AWS Athena Data Definition Language (DDL, like CREATE TABLE statements) and Data Manipulation Language (DML, like DELETE and INSERT) have the following limits: 

1.    Athena DDL max query limit: 20 DDL active queries . 

2.    Athena DDL query timeout limit: The Athena DDL query timeout is 600 minutes.

3.    Athena DML query limit: Athena only allows you to have 25 DML queries (running and queued queries) in the US East and 20 DML  queries in all other Regions by default.     

4.    Athena DML query timeout limit: The Athena DML query timeout limit is 30 minutes. 

5.    Athena query string length limit: The Athena query string hard limit is 262,144 bytes. 

Learn More About Athena Query Limits

We have put together a deep dive into Athena Query limits in Part 2 of this series, which you can read by following the link below:

AWS Athena partition limits

  1. Athena’s users can use AWS Glue, a data catalog and  ETL service. Athena’s partition limit is 20,000 per table and Glue’s limit is 1,000,000 partitions per table. 
  2. A Create Table As (CTAS) or INSERT INTO query can only create up to 100 partitions in a destination table. To work around this limitation you must manually chop up your data by running a series of INSERT INTOs that insert up to 100 partitions each.

Athena database limits

AWS Athena also has the following S3 bucket limitations: 

1.    Amazon S3 bucket limit is 100* buckets per account by default – you can request to increase it up to 1,000 S3 buckets per account.           

3.    Athena restricts each account to 100* databases, and databases cannot include over 100* tables.

*Note, recently Athena has increased this to 10K databases per account and 200K tables per database.

AWS Athena open-source alternative

Deploying your own PrestoDB cluster

An AWS Athena alternative is deploying your own PrestoDB cluster. AWS Athena is built on an old version of PrestoDB – in fact, it’s about 60 releases behind the PrestoDB project. Newer features are likely to be missing from Athena (and in fact it only supports a subset of PrestoDB features to begin with).

Deploying and managing PrestoDB on your own means you won’t have AWS Athena limitations such as the athena concurrent queries limit, concurrent queries limits, database limits, table limits, partitions limits, etc. Plus you’ll get the very latest version of Presto. PrestoDB is an open source project hosted by The Linux Foundation’s Presto Foundation. It has a transparent, open, and neutral community. 

If deploying and managing PrestoDB on your own is not an option (time, resources, expertise, etc.), Ahana can help.

Ahana Cloud for Presto: A fully managed service

Ahana Cloud for Presto is a fully managed Presto cloud service without the limits of AWS Athena.

You use AWS to query and analyze AWS data lakes stored in Amazon S3, and many other data sources, using the latest version of PrestoDB. Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, speed and ease of use. Ahana is a SaaS offering via a beautiful and easy to use console UI. Anyone at any knowledge level can use it with ease, there is zero configuration effort and no configuration files to manage. Many companies have moved from AWS Athena to Ahana Cloud.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

Up next: AWS Athena Query Limits

Related Articles 

Athena vs Presto

Learn the differences between Presto and Ahana and understand the pros and cons.

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

Reporting and dashboarding diagram

The Open Data Lakehouse – a quick intro

Data warehouses have been considered a standard to perform analytics on structured data but cannot handle unstructured data such as text, images, audio, video and other formats. Additionally, machine learning and AI are becoming common in every aspect of business and they need access to vast amounts of data outside of data warehouses.

The cloud transformation has triggered the disaggregation of compute and storage which leverages cost benefits and enables adaptability to store data coming from multiple dimensions. All this has led to a new data platform architecture called the Open Data Lakehouse. It solves the challenges of the traditional cloud data warehouse through its use of open source and open format technologies such as Presto and Hudi. In this blog you will learn more about open data lake analytics stack using Presto, Hudi, and AWS S3. 

What is an Open Data Lakehouse

The Open Data Lakehouse is based on the concept of bringing your warehouse workloads to the data lake. You can run analytics on technology and tools that do not require any vendor lock-in including licensing, data formats, interfaces and infrastructure.

Four key elements include:

Open source – The technologies on the stack we will be exploring for Open Data lake Analytics are completely open source under the Apache 2.0 license. This means that you benefit from the best innovations, not just from one vendor but from the entire community. 

Open formats – Also they don’t use any proprietary formats. In fact, it supports most of the common formats like JSON, Apache ORC, Apache Parquet and others.

Open interfaces – The interfaces are industry standard ANSI SQL compatible and standard JDBC / ODBC drivers can be used to connect to any reporting / dashboarding / notebook tool. And because it is open source, industry standard language clauses continue to be added in and expanded on. 

Open cloud – The stack is cloud agnostic and without storage natively aligns with containers and can be run on any cloud. 

Why Open Data Lakehouse

Open data lakehouses allow consolidation of structured and unstructured data in a central repository at cheaper cost and removes the complexity of running ETL, resulting in high performance and reducing cost and time to run analytics.

  • Bringing compute to your data (decouple of compute and storage)
  • Flexibility at the governance/transaction layer
  • Flexibility and low cost to store structured and semi/unstructured data
  • Flexibility at every layer – pick and choose which technology works best for your workloads/use case

Open Data Lakehouse architecture

Now let’s dive into the stack itself and each of the layers. We’ll discuss what problems each layer solves for.

The next EDW is the Open Data Lakehouse. Learn the data lakehouse format.

BI/Application tools – Data Visualization, Data Science tools

Plug in your BI/analytical application tool of choice. The Open Data Lake Analytics stack supports the use of JDBC/ODBC drivers so you can connect Tableau, Looker, preset, jupyter notebook, etc. based on your use case and workload. 

Presto – SQL Query Engine for the Data Lake

Presto is a parallel distributed SQL query engine for the data lake. It enables interactive, ad-hoc analytics on large amounts of data on data lakes. With Presto you can query data where it lives, including data sources like AWS S3, relational databases, NoSQL databases, and some proprietary data stores. 

Presto is built for high performance interactive querying with in-memory execution 

Key characteristics include: 

  • High scalability from 1 to 1000s of workers
  • Flexibility to support a wide range of SQL use cases
  • Highly pluggable architecture that makes it easy to extend Presto with custom integrations for security, event listeners, etc.
  • Federation of data sources particularly data lakes via Presto connectors
  • Seamless integration with existing SQL systems with ANSI SQL standard
deploying presto clusters

A full deployment of Presto has a coordinator and multiple workers. Queries are sub‐ mitted to the coordinator by a client like the command line interface (CLI), a BI tool, or a notebook that supports SQL. The coordinator parses, analyzes and creates the optimal query execution plan using metadata and data distribution information. That plan is then distributed to the workers for processing. The advantage of this decoupled storage model is that Presto is able to provide a single view of all of your data that has been aggregated into the data storage tier like S3.

Apache Hudi – Streaming Transactions in the Open Data Lake

One of the big drawbacks in traditional data warehouses is keeping the data updated. It requires building data mart/cubes then doing constant ETL from source to destination mart, resulting in additional time, cost and duplication of data. Similarly, data in the data lake needs to be updated  and consistent without that operational overhead. 

A transactional layer in your Open Data Lake Analytics stack is critical, especially as data volumes grow and the frequency at which data must be updated continues to increase. Using a technology like Apache Hudi solves for the following: 

  • Ingesting incremental data
  • Changing data capture, both insert and deletion
  • Incremental data processing
  • ACID transactions

Apache Hudi, which stands for Hadoop Upserts Deletes Incrementals, is an open-source based transaction layer with storage abstraction for analytics developed by Uber. In short, Hudi enables atomicity, consistency, isolation, and durability (ACID) transactions in a data lake. Hudi uses open file formats Parquet and Avro for data storage and internal table formats known as Copy-On-Write and Merge-On-Read.

It has built-in integration with Presto so you can query “hudi datasets” stored on the open file formats.

Hudi Data Management

Hudi has a table format which is based on directory structure and the table will have partitions, which are folders containing data files for that partition. It has indexing capabilities to support fast upserts. Hudi has two table types defining how data is indexed and layed out which defines how the underlying data is exposed to queries.

Hudi data management

(Image source: Apache Hudi)

  • Copy-On-Write (COW): Data is stored in Parquet file format (columnar storage), and each new update creates a new version of files during a write. Updating an existing set of rows will result in a rewrite of the entire parquet files for the rows being updated.
  • Merge-On-Read (MOR): Data is stored in a combination of Parquet file format (columnar) and Avro (row-based) file formats. Updates are logged to row-based delta files until compaction, which will produce new versions of the columnar files.

Based on the two table types Hudi provides three logical views for querying data from the Data Lake.

  • Read-optimized – Queries see the latest committed dataset from CoW tables and the latest compacted dataset from MoR tables
  • Incremental – Queries see new data written to the table after a commit/compaction. This helps to build incremental data pipelines and it’s analytics.
  • Real-time – Provides the latest committed data from a MoR table by merging the columnar and row-based files inline

AWS S3 – The Data Lake

The data lake is the central location for storing data from disparate sources such as structured, semi-structured and unstructured data and in open formats on object storage such as AWS S3.

Amazon Simple Storage Service (Amazon S3) is the de facto centralized storage to implement Open Data Lake Analytics.

Getting Started:

How to run Open data lake analytics workloads using Presto to query Apache Hudi datasets on S3

Now that you know the details of this stack, it’s time to get started. Here I’ll quickly show how you can actually use Presto to query your Hudi datasets on S3.

Ingest your data into AWS S3 and query with Presto

Data can be ingested on Data lake from different sources such as kafka and other databases, by introducing hudi into the data pipeline the needed Hudi tables will be created/updated and the data will be stored in either Parquet or Avro format based on the table type in S3 Data Lake. Later BI Tools/Application can query data using Presto which will reflect updated results as data gets updated.

Conclusion:

The Open Data Lake Analytics stack is becoming more widely used because of its simplicity, flexibility, performance and cost.

The technologies that make up that stack are critical. Presto, being the de-facto SQL query engine for the data lake, along with the transactional support and change data capture capabilities of Hudi, make it a strong open source and open format solution for data lake analytics but a missing component is Data Lake Governance which allows to run queries on S3 more securely. AWS has recently introduced Lake formation, a data governance solution for data lake and Ahana, a managed service for Presto seamlessly integrates Presto with AWS lake formation to run interactive queries on your AWS S3 data lakes with fine grained access to data.

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

How to Build a Data Lake Using Lake Formation on AWS

AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach.

What is an Open Data Lake in the Cloud?

Data Driven Insights diagram

Problems that necessitate a data lake

In today’s competitive landscape, companies are increasingly leveraging their data to make better decisions, providing value to their customers, and improving their operations. Data-driven insights can help business and product leaders hone in on customer needs and/or find untapped opportunities. Also, analytics dashboards can be presented to customers for added value. Traditionally, insights are gleaned from rather small amounts of enterprise data which is what you’d expect – historical information about products, customers, and sales. But now, the modern business must deal with 1000s of times more data, which encompasses more types of data and is far beyond Enterprise Data. Examples include 3rd party data feeds, IoT sensor data, event data, geospatial and other telemetry data.

The problem with having 1000s of times the data is that databases, and specifically data warehouses, can be very expensive. And data warehouses are optimized to handle relational data with a well-defined structure and schema. As both data volumes and usage grow, the costs of a data warehouse can easily spiral out of control. Those costs, coupled with the inherent lock-in associated with data warehouses, have left many companies looking for a better solution, either augmenting their enterprise data warehouse or moving away from them altogether. 

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture. 

The Open Data Lake in the cloud centers on S3-based object storage. In AWS, there can be many S3-buckets across an organization. In Google Cloud, there is a service called Google Cloud Store (GCS) and in Microsoft Azure it is called Azure blob store. The data lake can store the relational data that typically comes from business apps like the data warehouse stores. But the data lake also stores non-relational data from a variety of sources as mentioned above. The data lake can store structured, semi-structured, and/or unstructured data.

With all this data stored in the data lake, companies can run different types of analytics directly, such as SQL queries, real-time analytics, and AI/Machine Learning. A metadata catalog of the data enables the analytics of the non-relational data. 

Why Open for Data Lakes

As mentioned, companies have the flexibility to run different types of analytics, using different analytics engines and frameworks. Storing the data in open formats is the best-practice for companies looking to avoid the lock-in of the traditional cloud data warehouse. The most common formats of a modern data infrastructure are open, such as Apache Parquet and ORC. They are designed for fast analytics and are independent of any platform. Once data is in an open format like Parquet, it would follow to run open source engines like Presto on it. Ahana Cloud is a Presto managed service which makes it easy, secure, and cost efficient to run SQL on the Open Data Lake. 

If you want to learn more about why you should be thinking about building an Open Data Lake in the cloud, check out our free whitepaper on Unlocking the Business Value of the Data Lake – how open and flexible cloud services help provide value from data lakes.

Helpful Links

Best Practices for Resource Management in PrestoDB

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

5 main reasons Data Engineers move from AWS Athena to Ahana Cloud

Best Practices for Resource Management in PrestoDB

ahana logo

Resource management in databases allows administrators to have control over resources and assign a priority to sessions, ensuring the most important transactions get the major share of system resources. Resource management in a distributed environment makes accessibility of data easier and manages resources over the network of autonomous computers (i.e. Distributed System). The basis of resource management in the distributed system is also resource sharing.

PrestoDB is a distributed query engine written by Facebook as the successor to Hive for highly scalable processing of large volumes of data. Written for the Hadoop ecosystem, PrestoDB is built to scale to tens of thousands of nodes and process petabytes of data. In order to be usable at a production scale, PrestoDB was built to serve thousands of queries to multiple users without facing bottle-necking and “noisy neighbor” issues. PrestoDB makes use of resource groups in order to organize how different workloads are prioritized. This post discusses some of the paradigms that PrestoDB introduces with resource groups as well as best practices and considerations to think about before setting up a production system with resource grouping.

Getting Started

Presto has multiple “resources” that it can manage resource quotas for. The two main resources are CPU and memory. Additionally, there are granular resource constraints that can be specified such as concurrency, time, and cpuTime. All of this is done via a pretty ugly JSON configuration file shown in the  example below from the PrestoDB doc pages.

{
  "rootGroups": [
    {
      "name": "global",
      "softMemoryLimit": "80%",
      "hardConcurrencyLimit": 100,
      "maxQueued": 1000,
      "schedulingPolicy": "weighted",
      "jmxExport": true,
      "subGroups": [
        {
          "name": "data_definition",
          "softMemoryLimit": "10%",
          "hardConcurrencyLimit": 5,
          "maxQueued": 100,
          "schedulingWeight": 1
        },
        {
          "name": "adhoc",
          "softMemoryLimit": "10%",
          "hardConcurrencyLimit": 50,
          "maxQueued": 1,
          "schedulingWeight": 10,
          "subGroups": [
            {
              "name": "other",
              "softMemoryLimit": "10%",
              "hardConcurrencyLimit": 2,
              "maxQueued": 1,
              "schedulingWeight": 10,
              "schedulingPolicy": "weighted_fair",
              "subGroups": [
                {
                  "name": "${USER}",
                  "softMemoryLimit": "10%",
                  "hardConcurrencyLimit": 1,
                  "maxQueued": 100
                }
              ]
            },
            {
              "name": "bi-${tool_name}",
              "softMemoryLimit": "10%",
              "hardConcurrencyLimit": 10,
              "maxQueued": 100,
              "schedulingWeight": 10,
              "schedulingPolicy": "weighted_fair",
              "subGroups": [
                {
                  "name": "${USER}",
                  "softMemoryLimit": "10%",
                  "hardConcurrencyLimit": 3,
                  "maxQueued": 10
                }
              ]
            }
          ]
        },
        {
          "name": "pipeline",
          "softMemoryLimit": "80%",
          "hardConcurrencyLimit": 45,
          "maxQueued": 100,
          "schedulingWeight": 1,
          "jmxExport": true,
          "subGroups": [
            {
              "name": "pipeline_${USER}",
              "softMemoryLimit": "50%",
              "hardConcurrencyLimit": 5,
              "maxQueued": 100
            }
          ]
        }
      ]
    },
    {
      "name": "admin",
      "softMemoryLimit": "100%",
      "hardConcurrencyLimit": 50,
      "maxQueued": 100,
      "schedulingPolicy": "query_priority",
      "jmxExport": true
    }
  ],
  "selectors": [
    {
      "user": "bob",
      "group": "admin"
    },
    {
      "source": ".*pipeline.*",
      "queryType": "DATA_DEFINITION",
      "group": "global.data_definition"
    },
    {
      "source": ".*pipeline.*",
      "group": "global.pipeline.pipeline_${USER}"
    },
    {
      "source": "jdbc#(?<tool_name>.*)",
      "clientTags": ["hipri"],
      "group": "global.adhoc.bi-${tool_name}.${USER}"
    },
    {
      "group": "global.adhoc.other.${USER}"
    }
  ],
  "cpuQuotaPeriod": "1h"
}

Okay, so there is clearly a LOT going on here so let’s start with the basics and roll our way up. The first place to start is understanding the mechanisms Presto uses to enforce query resource limitation.

Penalties

Presto doesn’t enforce any resources at execution time. Rather, Presto introduces a concept of a ‘penalty’ for users who exceed their resource specification. For example, if user ‘bob’ were to kick off a huge query that ended up taking vastly more CPU time than allotted, then ‘bob’ would incur a penalty which translates to an amount of time that ‘bob’s’ queries would be forced to wait in a queued state until they could be runnable again. To see this scenario at hand, let’s split the cluster resources by half and see what happens when two users attempt to submit 5 different queries each at the same time.

Resource Group Specifications

The example below is a resource specification of how to evenly distribute CPU resources between two different users.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 5,
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 5,
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1h"
}

The above resource config defines two main resource groups called ‘query1’ and ‘query2’. These groups will serve as buckets for the different queries/users. A few parameters are at work here:

  • hardConcurrencyLimit sets the number of concurrent queries that can be run within the group
  • maxQueued sets the limit on how many queries can be queued
  • schedulingPolicy ‘fair’ determines how queries within the same group are prioritized

Kicking off a single query as each user has no effect, but subsequent queries will stay QUEUED until the first completes. This at least confirms the hardConcurrencyLimit setting. Testing queuing 6 queries also shows that the maxQueued is working as intended as well.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "30s",
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "30s",
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1m"
}

Introducing the soft CPU limit will penalize any query that is caught using too much CPU time in a given CPU period. Currently this is set to 1 minute and each group is given half of that CPU time. However, testing the above configuration yielded some odd results. Mostly, once the first query finished, other queries were queued for an inordinately long amount of time. Looking at the Presto source code shows the reasoning. The softCpuLimit and hardCpuLimit are based on a combination of total cores and the cpuQuotaPeriod. For example, on a 10 node cluster with r5.2xlarge instances, each Presto Worker node has 8 vCPU. This leads to a total of 80 vCPU for the worker which then results in 80m of vCPUminutes in the given cpuQuotaPeriod. Therefore, the correct values are shown  below.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "40m",
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "40m",
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1m"
}

With testing, the above resource group spec results in two queries completing – using a total of 127m CPU time. From there, all further queries block for about 2 minutes before they run again. This blocked time adds up because for every minute of cpuQuotaPeriod, each user is granted 40 minutes back on their penalty. Since the first minute queries exceeded by 80+ minutes, it would take 2 cpuQuotaPeriods to bring the penalty back down to zero so queries could submit again.

Conclusion

Resource group implementation in Presto definitely has some room for improvement. The most obvious is that for ad hoc users who may not understand the cost of their query before execution, the resource group will heavily penalize them until they submit very low cost queries. However, this solution will minimize the damage that a single user can perform on a cluster over an extended duration and will average out in the long run. Overall, resource groups are better suited for scheduled workloads which depend on variable input data so that a specified scheduled job doesn’t arbitrarily end up taking over a large chunk of resources. For resource partitioning between multiple users/teams the best approach still seems to be to run and maintain multiple segregated Presto clusters.


Ready to get started with Presto? Check out our tutorial series where we cover the basics: Presto 101: Installing & Configuring Presto locally.

Configuring RaptorX – a multi-level caching with Presto

Multi-level-Data-Lake-Cashing-with-RaptorX

RaptorX Background and Context

Meta introduced a multi-level cache at PrestoCon 2021. Code-named the “RaptorX Project,” it aims to make Presto 10x faster on Meta- scale petabyte workloads. Here at Ahana, engineers have also been working on RaptorX to help make it  usable for the community by fixing a few open issues, tuning and testing heavily with other workloads. This is a unique and very powerful feature only available in PrestoDB and not any other versions or forks of the Presto project.

Presto is the disaggregated compute-storage query engine, which helps customers and cloud providers scale independently and reduce costs. However, storage-compute disaggregation also brings new challenges for query latency as scanning huge amounts of data between storage tier and compute tier is going to be IO-bound over the network.  As with any database, optimized I/O is a critical concern to Presto. When possible, the priority is to not perform any I/O at all. This means that memory utilization and caching structures are of utmost importance.

Let’s understand the normal workflow of how Presto-Hive connector works –

  1. During a read operation, the planner sends a request to the metastore for metadata (partition info)
  2. Scheduler sends requests to remote storage to get a list of files and does the scheduling
  3. On the worker node, first, it receives the list of files from the scheduler and sends a request to remote storage to open a file and read the file footers
  4. Based on the footer, Presto understands what are the data blocks or chucks we need to read from remote storage
  5. Once workers read them, Presto performs computation on the leaf worker nodes based on join or aggregation and does the shuffle back to send query results to the client.

These are a lot of RPC calls not just for the Hive Metastore to get the partitions information but also for the remote storage to list files, schedule those files, to open files, and then to retrieve and read those data files from remote storage. Each of these IO paths for Hive connectors is a bottleneck on query performance and this is the reason we build multi-layer cache intelligently so that you can max cache hit rate and boost your query performance.

RaptorX introduces a total five types of caches and a scheduler. This cache system is only applicable to Hive connectors.

Multi-layer CacheTypeAffinity SchedulingBenefits
Data IO Local DiskRequiredReduced query latency
Intermediate Result SetLocal DiskRequiredReduced query latency and CPU utilization for aggregation queries 
File MetadataIn-memoryRequiredReduced CPU & latency decrease
Metastore In-memoryNAReduced query latency
File ListIn-memoryNAReduced query latency
Table: Summary of Presto Multi Layer Cache Implementation

Further, this article explains how you can configure and test various layers of RaptorX cache in your Presto cluster.

#1 Data(IO) cache

This cache makes use of a library which is built using the alluxio LocalCacheFileSystem which is an implementation of the HDFS interface. The alluxio data cache is the worker node local disk cache that stores the data read from the files(ORC,Parquet etc.,) on remote storage. The default page size on disk is 1MB. Uses LRU policy for evictions and in order to enable this cache we require local disks. 

To enable this cache, worker configuration needs to be updated with below properties at

etc/catalog/<catalog-name>.properties 

cache.enabled=true 
cache.type=ALLUXIO 
cache.alluxio.max-cache-size=150GB — This can be set based on the requirement. 
cache.base-directory=file:///mnt/disk1/cache

Also add below Alluxio property to coordinator and worker etc/jvm.config to emit all metrics related to Alluxio cache
-Dalluxio.user.app.id=presto

#2 Fragment result set cache

This is nothing but an intermediate reset set cache that lets you cache partially computed results set on the worker’s local SSD drive. This is to prevent duplicated computation upon multiple queries which will improve your query performance and decrease CPU usage. 

Add the following properties under the /config.properties

fragment-result-cache.enabled=true 
fragment-result-cache.max-cached-entries=1000000 
fragment-result-cache.base-directory=file:///data/presto-cache/2/fragmentcache 
fragment-result-cache.cache-ttl=24h

#3 Metastore cache

A Presto coordinator caches table metadata (schema, partition list, and partition info) to avoid long getPartitions calls to metastore. This cache is versioned to confirm validity of cached metadata.

In order to enable metadata cache set below properties at /<catalog-name>.properties 

hive.metastore-cache-scope=PARTITION
hive.metastore-cache-ttl=2d
hive.metastore-refresh-interval=3d
hive.metastore-cache-maximum-size=10000000

#4 File List cache

A Presto coordinator caches file lists from the remote storage partition directory to avoid long listFile calls to remote storage. This is coordinator only in-memory cache.

Enable file list cache by setting below set of properties at

/catalog/<catalog-name>.properties 

# List file cache
hive.file-status-cache-expire-time=24h 
hive.file-status-cache-size=100000000 
hive.file-status-cache-tables=*

#5 File metadata cache

Caches open file descriptors and stripe/file footer information in worker memory. These pieces of data are most frequently accessed when reading files. This cache is not just useful for decreasing query latency but also to reduce CPU utilization.

This is in memory cache and suitable for ORC and Parquet file formats.

For ORC, it includes file tail(postscript, file footer, file metadata), stripe footer and stripe stream(row indexes/bloom filters).

For Parquet, it caches the file and block level metadata.

In order to enable metadata cache set below properties at /<catalog-name>.properties 

# For ORC metadata cache: <catalog-name>.orc.file-tail-cache-enabled=true 
<catalog-name>.orc.file-tail-cache-size=100MB 
<catalog-name>.orc.file-tail-cache-ttl-since-last-access=6h 
<catalog-name>.orc.stripe-metadata-cache-enabled=true 
<catalog-name>.orc.stripe-footer-cache-size=100MB 
<catalog-name>.orc.stripe-footer-cache-ttl-since-last-access=6h 
<catalog-name>.orc.stripe-stream-cache-size=300MB 
<catalog-name>.orc.stripe-stream-cache-ttl-since-last-access=6h 

# For Parquet metadata cache: 
<catalog-name>.parquet.metadata-cache-enabled=true 
<catalog-name>.parquet.metadata-cache-size=100MB 
<catalog-name>.parquet.metadata-cache-ttl-since-last-access=6h

The <catalog-name> in the above configuration should be replaced by the catalog name that you are setting these in. For example, If the catalog properties file name is ahana_hive.properties then it should be replaced with “ahana_hive”. 

#6 Affinity scheduler

With affinity scheduling, Presto Coordinator schedules requests that process certain data/files to the same Presto worker node  to maximize the cache hits. Sending requests for the same data consistently to the same worker node means less remote calls to retrieve data.

Data caching is not supported with random node scheduling. Hence, this is a must have property that needs to be enabled in order to make RaptorX Data IO, Fragment result cache, and File metadata cache working. 

In order to enable affinity scheduler set below property at /catalog.properties

hive.node-selection-strategy=SOFT_AFFINITY

How can you test or debug your RaptorX cache setup with JMX metrics?

Each section describes queries to be run and query the jmx metrics to verify the cache usage.

Note: If your catalog is not named ‘ahana_hive’, you will need to change the table names to verify the cache usage. Substitute ahana_hive with your catalog name.

Data IO Cache

Queries to trigger Data IO cache usage

USE ahana_hive.default; 
SELECT count(*) from customer_orc group by nationkey; 
SELECT count(*) from customer_orc group by nationkey;

Queries to verify Data IO data cache usage

-- Cache hit rate.
SELECT * from 
jmx.current."com.facebook.alluxio:name=client.cachehitrate.presto,type=gauges";

-- Bytes read from the cache
SELECT * FROM 
jmx.current."com.facebook.alluxio:name=client.cachebytesreadcache.presto,type=meters";

-- Bytes requested from cache
SELECT * FROM 
jmx.current."com.facebook.alluxio:name=client.cachebytesrequestedexternal.presto,type=meters";

-- Bytes written to cache on each node.
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CacheBytesWrittenCache.presto,type=meters";

-- The number of cache pages(of size 1MB) currently on disk
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CachePages.presto,type=counters";

-- The amount of cache space available.
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CacheSpaceAvailable.presto,type=gauges";

-- There are many other metrics tables that you can view using the below command.
SHOW TABLES FROM 
jmx.current like '%alluxio%';

Fragment Result Cache

An example of the query plan fragment that is eligible for having its results cached is shown below.

Fragment 1 [SOURCE] 
Output layout: [count_3] Output partitioning: SINGLE [] Stage Execution 
Strategy: UNGROUPED_EXECUTION 
- Aggregate(PARTIAL) => [count_3:bigint] count_3 := "presto.default.count"(*) 
- TableScan[TableHandle {connectorId='hive', 
connectorHandle='HiveTableHandle{schemaName=default, tableName=customer_orc, 
analyzePartitionValues=Optional.empty}', 
layout='Optional[default.customer_orc{}]'}, gr Estimates: {rows: 150000 (0B), 
cpu: 0.00, memory: 0.00, network: 0.00} LAYOUT: default.customer_orc{}

Queries to trigger fragment result cache usage:

SELECT count(*) from customer_orc; 
SELECT count(*) from customer_orc;

Query Fragment Set Result cache JMX metrics.

-- All Fragment result set cache metrics like cachehit, cache entries, size, etc 
SELECT * FROM 
jmx.current."com.facebook.presto.operator:name=fragmentcachestats";

ORC metadata cache

Queries to trigger ORC cache usage

SELECT count(*) from customer_orc; 
SELECT count(*) from customer_orc;

Query ORC Metadata cache JMX metrics

-- File tail cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_orcfiletail,type=cachestatsmbean";

 -- Stripe footer cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripefooter,type=cachestatsmbean"; 

-- Stripe stream(Row index) cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripestream,type=cachestatsmbean";

Parquet metadata cache

Queries to trigger Parquet metadata cache

SELECT count(*) from customer_parquet; 
SELECT count(*) from customer_parquet;

Query Parquet Metadata cache JMX metrics.

-- Verify cache usage 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_parquetmetadata,type=cachestatsmbean";

File List cache

Query File List cache JMX metrics.

-- Verify cache usage 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive,type=cachingdirectorylister";

In addition to this, we have enabled these multilayer caches on Presto for Ahana Cloud by adding S3 support as the external filesystem for Data IO cache, more optimized scheduling and tooling to visualize the cache usage. 

Multi-level Data Lake Cashing with RaptorX
Figure: Multi-level Data Lake Cashing with RaptorX

Ahana-managed Presto clusters can take advantage of RaptorX cache and at Ahana we have simplified all these steps so that data platform users can enable these Data Lake caching seamlessly with just a one click. Ahana Cloud for Presto enables you to get up and running with the Open Data Lake Analytics stack in 30 minutes. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our on-demand webinar where we share how you can build an Open Data Lake Analytics stack.

Benchmarking Warehouse Workloads on the Data Lake using Presto

TPC-H Benchmark Whitepaper

How to run a TPC-H Benchmark on Presto

Presto is an open source MPP Query engine designed from the ground up for high performance with linear scaling. Businesses looking to solve their analytics workload using Presto need to understand how to evaluate Presto performance and this document will help in the endeavor of benchmarking Presto. 

To help users who would like to benchmark Presto, we’ve written a technical guide on how to set up your Presto benchmark using benchto, an open source framework that provides an easy and manageable way to define, run and analyze macro benchmarks in clustered environment.

Running a benchmark on Presto can help you to identify things like: 

  • system resource requirements 
  • resource usage during various operations 
  • performance metrics for such operations
  • ..and more, depending on your workload and use case

This technical guide provides an overview on TPC-H, the industry standard for benchmarking, and explains how to configure and use the open-source Benchto tool to benchmark Presto. It also shows an example on comparing results between two different runs of an Ahana-managed Presto cluster with and without cache enabled.

We hope you find this useful! Happy benchmarking.

Announcing the Ahana Cloud for Presto integration with AWS Lake Formation

We’re excited to announce that Ahana Cloud for Presto now integrates with AWS Lake Formation, including support for the recent general availability of row-level security.

AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. Customers can manage permissions to data in a single place, making it easier to enforce security across a wide range of tools and services. Over the past several months we’ve worked closely with the AWS Lake Formation team to bring Lake Formation capabilities to Presto on AWS.  Further, we’re grateful to our customers who were willing to preview early versions of our integration.

Today, Ahana Cloud for Presto allows customers to use Presto to query their data protected with AWS Lake Formation fine-grained permissions with a few clicks.  Our customers can bring Presto to their existing AWS stack and scale their data teams without compromising security.  We’re thrilled that the easiest managed service for Presto on AWS just got easier and more secure.

Here’s a quick video tutorial that shows you how easy it is to get started with AWS Lake Formation and Ahana:

Additionally, we’ve put together a list of resources where you can learn more about the integration.

What’s Next?

If you’re ready to get started with AWS Lake Formation and Ahana Cloud, head over to our account sign up page where you can start with a free 14-day trial of Ahana Cloud. You can also drop us a note at product@ahana.io and we can help get you started. Happy building!

Presto on Spark

Overview

Presto was originally designed to run interactive queries against data warehouses, but now it has evolved into a unified SQL engine on top of open data lake analytics for both interactive and batch workloads. Popular workloads on data lakes include:

1. Reporting and dashboarding

This includes serving custom reporting for both internal and external developers for business insights and also many organizations using Presto for interactive A/B testing analytics. A defining characteristic of this use case is a requirement for low latency. It requires tens to hundreds of milliseconds at very high QPS, and not surprisingly this use case is almost exclusively using Presto and that’s what Presto is designed for.

2. Data science with SQL notebooks

This use case is one of ad hoc analysis and typically needs moderate latency ranging from seconds to minutes. These are the queries of data scientist, and business analysts who want to perform compact ad hoc analysis to understand product usage, for example, user trends and how to improve the product. The QPS is relatively lower because users have to manually initiate these queries.

3. Batch processing for large data pipelines

These are scheduled jobs that are running every day, hour, or whenever the data is ready. They often contain queries over very large volumes of data and the latency can be up to tens of hours and processing can range from CPU days to years and terabytes to petabytes of data.

Presto works exceptionally effectively for ad-hoc or interactive queries today, and even some batch queries, with the constraint that the entire query must fit in memory and run quickly enough that fault tolerance is not required. Most ETL batch workloads that don’t fit in this box are running on “very big data” compute engines like Apache Spark. Having multiple compute engines with different SQL dialects and APIs makes managing and scaling these workloads complicated for data platform teams. Hence, Facebook decided to simplify and build Presto on Spark as the path to further scale Presto. Before we get into Presto on Spark, let me explain a bit more about the architecture of each of these two popular engines.

Presto’s Architecture

Image

Presto is designed for low latency and follows the classic MPP architecture; it uses in-memory streaming shuffle to achieve low latency. Presto has a single shared coordinator per cluster with an associated pool of workers. Presto tries to schedule as many queries as possible on the same Presto worker (shared executor), in order to support multi-tenancy.

This architecture provides very low latency scheduling of tasks and allows concurrent processing of multiple stages of a query, but the tradeoff is that the coordinator is a SPOF and bottleneck, and queries are poorly isolated across the entire cluster.

Additionally streaming shuffle does not allow for much fault tolerance further impacting the reliability of long running queries.

Spark’s Architecture

Image

On other hand, Apache Spark is designed for scalability from the very beginning and it implements a Map-Reduce architecture. Shuffle is materialized to disk fully between stages of execution with the capability to preempt or restart any task. Spark maintains an isolated Driver to coordinate each query and runs tasks in isolated containers scheduled on demand. These differences improve reliability and reduce overall operational overhead.

Why Presto alone isn’t a good fit for batch workloads?

Scaling an MPP architecture database to batch data processing over Internet-scale datasets is known to be an extremely difficult problem [1]. To simplify this let’s examine the below aggregation query. Essentially this query goes over the orders table in TPCH and does aggregation grouping on custom keys, and summing the total price. Presto leverages in-memory shuffle and executes shuffle on the custom key, after reading the data and doing aggregation for the same key, on each worker.

Image

Doing in-memory shuffle means the producer will buffer data in memory and wait for the data to be fetched by the consumer as a result. We have to execute all the tasks, before and after the exchange at the same time. So thinking about in the mapreduce world all the mappers and the reducer have to be run concurrently. This makes in-memory shuffle an all-or-nothing exclusion model.

This causes inflexible scheduling and scaling query size becomes more difficult because everything is running concurrently. In the aggregation phase the query may exceed the memory limit because everything has to be held in the memory in hash tables in order to track each group (custkey).

Additionally we are limited by the size of a cluster in how many nodes we can hash partition the data across to avoid having to fit it all in memory. Using distributed disk (Presto-on-Spark, Presto Unlimited) we can partition the data further and are only limited by the number of open files and even that is a limit that can be scaled quite a bit by a shuffle service.

For that reason it makes Presto difficult to scale to very large and complex batch pipelines. Such pipelines remain running for hours, all to join and aggregate over a huge amount of data. This motivated the development of Presto Unlimited which adapts Presto’s MPP design to large ETL workloads, and improves user experience at scale.

Image

While Presto Unlimited solved part of the problem by allowing shuffle to be partitioned over distributed disk, it didn’t fully solve fault tolerance, and did nothing to improve isolation and resource management.

Presto on Spark

Presto on Spark is an integration between Presto and Spark that leverages Presto’s compiler/evaluation as a library with Spark’s RDD API used to manage execution of Presto’s embedded evaluation. This is similar to how Google chose to embed F1 Query inside their MapReduce framework.

The high level goal is to bring a fully disaggregated shuffle to Presto’s MPP run time and we achieved this by adding a materialization step right after the shuffle. The materialized shuffle is modeled as a temporary partition table, which brings more flexible execution after shuffle and allows to partition level retries. With Presto on Spark, we can do a fully disaggregated shuffle on custom keys for the above query both on mapper and reducer side, this means all mappers and reducers can be independently scheduled and are independently retriable.

Presto on Spark

Presto On Spark at Intuit

Superglue is a homegrown tool at Intuit that helps users build, manage and monitor data pipelines. Superglue was built to democratize data for analysts and data scientists. Superglue minimizes time spent developing and debugging data pipelines, and maximizes time spent on building business insights and AI/ML.

Many analysts at Intuit use Presto (AWS Athena) to explore data in the Data Lake/S3. These analysts would spend several hours converting these exploration SQLs written for Presto to Spark SQL to operationalize/schedule them as data pipelines in Superglue. To minimize SQL dialect conversion issues and associated productivity loss for analysts, the Intuit team started to explore various options including query translation, query virtualization, and presto on spark. After a quick POC, Intuit decided to go with Presto on Spark as it leverages Presto’s compiler/evaluation as a library (no query conversion is required) and Spark’s scalable data processing capabilities.

Presto on Spark is now in production at Intuit. In three months, there are hundreds of critical pipelines that have thousands of jobs running on Presto On Spark via Superglue.

Presto on Spark runs as a library that is submitted with spark-submit or Jar Task on the Spark cluster. Scheduled batch data pipelines are launched on ephemeral clusters to take advantage of resource isolation, manage cost, and minimize operational overhead. DDL statements are executed against Hive and DML statements are executed against Presto. This enables analysts to write Hive-compatible DDL and the user experience remains unchanged.

This solution helped enable a performant and scalable platform with seamless end-to-end experience for analysts to explore and process data. It thereby improved analysts’ productivity and empowered them to deliver insights at high speed.

When To Use Spark’s Execution Engine With Presto

Spark is the tool of choice across the industry for running large scale complex batch ETL pipelines. Presto on Spark heavily benefits pipelines written in Presto that operate on terabytes/petabytes of data, as it takes advantage of Spark’s large scale processing capabilities. The biggest win here is that no query conversion is required and you can leverage Spark for

  • Scaling to larger data volumes
  • Scaling Presto’s resource management to larger clusters
  • Increase reliability and elasticity of Presto as a compute engine

Why ‘Presto on Spark’ matters

We tried to achieve the following to adapt ‘Presto on Spark’ to Internet-scale batch workloads [2]:

  • Fully disaggregated shuffles
  • Isolated executors
  • Presto resource management, Different Scheduler, Speculative Execution, etc.

A unified option for batch data processing and ad hoc is very important for creating the experience of queries that scale instead of fail without requiring rewrites between different SQL dialects. We believe this is only a first step towards more confluence between the Spark and the Presto communities, and a major step towards enabling unified SQL experience between interactive and batch use cases. Today many internet giants like Facebook, etc. have moved over to Presto on Spark and we have seen many organizations including Intuit started running their complex data pipelines in production with Presto on Spark.

“Presto on Spark” is one of the most active development areas in Presto, feel free check it out and please give it a star! If you have any questions, feel free to ask in the PrestoDB Slack Channel.

Reference

[1] MapReduce: Simplified Data Processing on Large Clusters 

[2] Presto-on-Spark: A Tale of Two Computation Engines

Tutorial: How to define SQL functions with Presto across all connectors

Presto is the open source SQL query engine for data lakes. It supports many native functions which are usually sufficient for most use cases. However, there is may be a corner case where you need to implement your own function. To simplify this, Presto allows users to define expressions as SQL functions. These are dynamic functions separated from the Presto source code, managed by a functions namespace manager that you can set up with a MySQL database. In fact, this is one of the most widely used features of Presto at Facebook, with over 1000s of functions defined.

Function Namespace Manager

A function namespace is a special catalog.schema that stores functions in the format like mysql.test. Each catalog.schema can be a function namespace. A function namespace manager is a plugin that manages a set of these function catalog schemas. Catalog can be mapped to connectors in Presto (a connector for functions, no tables or view) and allows the Presto engine to perform actions such as creating, altering, and deleting functions.

This user defined function management is separated from connector API for flexibility, hence these SQL functions can be used across all connectors. Further, the query is guaranteed to use the same version of the function throughout the execution and any modification to the functions is versioned. 

Implementation

Today, function namespace manager is implemented with the help of MySQL, so users need to have a running MySQL service to initialize the MySQL based function namespace manager. 

Step 1: Provision MySQL server and generate jdbc url for further access.

Suppose the MySQL server can be reached at localhost:3306, example database url – 

jdbc:mysql://localhost:3306/presto?user=root&password=password

Step 2: Create database & tables in MySQL database to store function namespace manager related data

 CREATE DATABASE presto;
 USE presto;

Step 3: Configure at Presto [2]

Create Function namespace manager configuration under etc/function-namespace/mysql.properties:

function-namespace-manager.name=mysql database-url=jdbc:mysql://localhost:3306/presto?user=root&password=password
function-namespaces-table-name=function_namespaces
functions-table-name=sql_functions

And restart the Presto Service.

Step 4: Create new function namescape

Now once the Presto server is started we will see below tables under presto database (which is being used to manage function namespace) in Mysql –

mysql> show tables;
+---------------------+
| Tables_in_presto    |
+---------------------+
| enum_types          |
| function_namespaces |
| sql_functions       |
+---------------------+
93 rows in set (0.00 sec)

To create a new function namespace ”ahana.default”, insert into the function_namespaces table:

INSERT INTO function_namespaces (catalog_name, schema_name)
    VALUES('ahana', 'default');

Step 5: Create a function and query from Presto [1]

SQL functions_blog

Here is simple example of SQL function for COSECANT: 

presto>CREATE OR REPLACE FUNCTION ahana.default.cosec(x double)
RETURNS double
COMMENT ‘Cosecant trigonometric function'
LANGUAGE SQL
DETERMINISTIC
RETURNS NULL ON NULL INPUT
RETURN 1 / sin(x);

More examples can be found at https://prestodb.io/docs/current/sql/create-function.html#examples [1]

Step 6: Apply the newly created function and SQL query

SQL functions_blog

It is required for users to use fully qualified function name while using in SQL query.

Following the the example of using cosec SQL function in the query. 

presto> select ahana.default.cosec (50) as Cosec_value;
     Cosec_value     
---------------------
 -3.8113408578721053 
(1 row)

Query 20211103_211533_00002_ajuyv, FINISHED, 1 node
Splits: 33 total, 33 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s

Here is another simple example of creating an EpochTimeToLocalDate function to convert Unix time to local timezone under ahana.default function namespace.

presto> CREATE FUNCTION ahana.default. EpochTimeToLocalDate (x bigint) 
     -> RETURNS timestamp 
     -> LANGUAGE SQL 
     -> DETERMINISTIC RETURNS NULL ON NULL INPUT 
     -> RETURN from_unixtime (x);
CREATE FUNCTION

ahana.default.EpochTimeToLocalDate(1629837828) as date;
          date           
-------------------------
 2021-08-24 13:43:48.000 
(1 row)

Note

 function-namespaces-table-name  <The name of the table that stores all the function namespaces managed by this manager> property can be used if there is a use case  to instantiate multiple function namespace managers,  otherwise if we can create functions in only one function namespace manager then it can be utilized across all different databases or connectors. [2]

At Ahana we have simplified all these steps that is MySQL container, Schema, databases, tables and additional configurations required to manage functions and data platforms users just need to create their own SQL functions and use them in SQL queries, that’s it, no need to be wary about provisioning and managing additional MySQL servers. 

Future Roadmap

Remote function Support with remote UDF thrift API 

Allows you to run arbitrary functions that are either not safe or not possible to run within worker JVM: unreliable Java functions, C++, Python, etc.

References

[1] DDL Syntax to use FUNCTIONS

[2] Function Namespace Manager Documentation

Presto 105: Running Presto with AWS Glue as catalog on your Laptop

Introduction

This is the 5th tutorial in our Getting Started with Presto series. To recap, here are the first 4 tutorials:

Presto 101: Installing & Configuring Presto locally

Presto 102: Running a three node PrestoDB cluster on a laptop

Presto 103: Running a Prestodb cluster on GCP

Presto 104: Running Presto with Hive Metastore

Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial we will show you how to run Presto with AWS Glue as a catalog on a laptop.

We did mention in the tutorial Presto 104 why we are using a catalog. Just to recap, Presto is a disaggregated database engine. This means that Presto has the top part of the database stack – the SQL parser, compiler, optimizer, scheduler, execution engine – but it does not have other components of the database – this includes the system catalog. In the data lake world, the system catalog where the database schema resides is a catalog. There are two popular catalogs that have emerged – Hive Metastore and AWS Glue catalog.

What is AWS Glue?

AWS Glue is an event-driven, serverless computing platform provided by AWS. AWS Glue provides data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. The AWS Glue catalog does the mapping between the database tables and columns and the objects or files that reside in the data lake. This could be files or immutable objects in AWS S3.

In this tutorial, we will focus on using Presto with the AWS Glue on your laptop.   

This document simplifies the process for a laptop scenario to get you started. For real production workloads, you can try out Ahana Cloud which is a managed service for Presto on AWS and comes pre-integrated with an AWS Glue catalog.

Implementation steps

Step 1: 

Create a docker network namespace, so that containers could communicate with each other using the network namespace.

C:\Users\rupendran>docker network create presto_networkd0d03171c01b5b0508a37d968ba25638e6b44ed4db36c1eff25ce31dc435415b

Step 2: 

Ahanaio has developed a sandbox for prestodb which can be downloaded from docker hub, use the command below to download prestodb sandbox, which comes with all packages needed to run prestodb.

C:\Users\prestodb>docker pull ahanaio/prestodb-sandbox
Using default tag: latest
latest: Pulling from ahanaio/prestodb-sandbox
da5a05f6fddb: Pull complete                                                          

e8f8aa933633: Pull complete                                                          
b7cf38297b9f: Pull complete                                                          
a4205d42b3be: Pull complete                                                          
81b659bbad2f: Pull complete                                                          
ef606708339: Pull complete                                                          
979857535547: Pull complete                                                          
Digest: sha256:d7f4f0a34217d52aefad622e97dbcc16ee60ecca7b78f840d87c141ba7137254
Status: Downloaded newer image for ahanaio/prestodb-sandbox:latest
docker.io/ahanaio/prestodb-sandbox:latest

Step 3:  

Start the instance of the the prestodb sandbox and name it as coordinator

#docker run -d -p 8080:8080 -it --net presto_network --name coordinator
ahanaio/prestodb-sandboxd
b74c6f7c4dda975f65226557ba485b1e75396d527a7b6da9db15f0897e6d47f

Step 4:

We only want the coordinator to be running on this container without the worker node. So let’s edit  the config.properties file and set the node-scheduler.include-cooridinator to false.

sh-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery-server.enabled=true
discovery.uri=http://localhost:8080
sh-4.2#

Step 5:

Restart the docker container running coordinator. Since we updated the config file to run this instance only as a Presto coordinator and stopped the worker service.

# docker restart coordinator

Step 6:

Create three more containers using ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8081:8081 -it --net presto_network --name worker1
ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8082:8082 -it --net presto_network --name worker2
ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8083:8083 -it --net presto_network --name worker3
ahanaio/prestodb-sandbox

Step 7:

Edit the etc/config.properties file in each of the three worker containers and set coordinator to false, http-server.http.port to 8081/8082/8083 respectively for each worker and finally discovery.uri should point to coordinator.

sh-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://coordinator:8080

Step 8:

Now we will Install aws-cli and configure AWS glue on the coordinator and worker containers.

# yum install -y aws-cli

Step 9: 

Create glue user and attach to policy AmazonS3FullAccess and AWSGlueConsoleFull Access

aws iam create-user --user-name glueuser
{
    "User": {
        "Path": "/",
        "UserName": "glueuser",
        "UserId": "AXXXXXXXXXXXXXXXX",
        "Arn": "arn:aws:iam::XXXXXXXXXX:user/glueuser",
        "CreateDate": "2021-10-07T01:07:28+00:00"
    }
}

aws iam list-policies | grep AmazonS3FullAccess
            "PolicyName": "AmazonS3FullAccess",
            "Arn": "arn:aws:iam::aws:policy/AmazonS3FullAccess",

aws iam list-policies | grep AWSGlueConsoleFullAccess
            "PolicyName": "AWSGlueConsoleFullAccess",
            "Arn": "arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess",

aws iam attach-user-policy --user-name glueuser --policy-arn "arn:aws:iam::aws:policy/AmazonS3FullAccess"

aws iam attach-user-policy --user-name glueuser --policy-arn "arn:aws:iam::aws:policy/AmazonS3FullAccess"

Step 10:

Create access key

% aws iam create-access-key --user-name glueuser
{
   "AccessKey": {
       "UserName": "glueuser",
        "AccessKeyId": "XXXXXXXXXXXXXXXXXX", 
       "Status": "Active",
        "SecretAccessKey": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "CreateDate": "2021-10-13T01:50:45+00:00"
    }
}

Step 11:

Run aws configure and enter the access and secret key configured.

aws configure
AWS Access Key ID [None]: XXXXXXXXXXXXXAWS
Secret Access Key [None]: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Default region name [None]:
Default output format [None]

Step 12:

Create /opt/presto-server/etc/catalog/glue.properties file add the AWS Glue properties to presto, this file needs to be added on both coordinator and worker containers. Add the AWS access and secret keys generated in the previous step to hive.metastore.glue.aws-access-key and hive.metastore.glue.aws-secret-key.

connector.name=hive-hadoop2
hive.metastore=glue
hive.non-managed-table-writes-enabled=true
hive.metastore.glue.region=us-east-2
hive.metastore.glue.aws-access-key=<your AWS key>
hive.metastore.glue.aws-secret-key=<your AWS Secret Key>

Step 13:

Restart the coordinator and all worker containers

#docker restart coordinator
#docker restart worker1
#docker restart worker2
#docker restart worker3

Step 14:

Run the presto-cli and use glue as catalog

bash-4.2# presto-cli --server localhost:8080 --catalog glue

Step 15:

Create a schema using S3 location.

presto:default> create schema glue.demo with (location= 's3://Your_Bucket_Name/demo');
CREATE SCHEMA
presto:default> use demo;

Step 16:

Create table under glue.demo schema

presto:demo> create table glue.demo.part with (format='parquet') AS select * from tpch.tiny.part;
CREATE TABLE: 2000 rows
    
Query 20211013_034514_00009_6hkhg, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:06 [2K rows, 0B] [343 rows/s, 0B/s]

Step 17:

Run select statement on the newly created table.

presto:demo> select * from glue.demo.part limit 10; 
partkey |                   name                   |      mfgr      |  brand
---------+------------------------------------------+----------------+---------
       1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13
       2 | blush thistle blue yellow saddle         | Manufacturer#1 | Brand#13
       3 | spring green yellow purple cornsilk      | Manufacturer#4 | Brand#42
       4 | cornflower chocolate smoke green pink    | Manufacturer#3 | Brand#34
       5 | forest brown coral puff cream            | Manufacturer#3 | Brand#32
       6 | bisque cornflower lawn forest magenta    | Manufacturer#2 | Brand#24
       7 | moccasin green thistle khaki floral      | Manufacturer#1 | Brand#11
       8 | misty lace thistle snow royal            | Manufacturer#4 | Brand#44
       9 | thistle dim navajo dark gainsboro        | Manufacturer#4 | Brand#43
      10 | linen pink saddle puff powder            | Manufacturer#5 | Brand#54

Summary

In this tutorial, we provide steps to use Presto with AWS Glue as a catalog on a laptop. If you’re looking to get started easily with Presto and a pre-configured Glue catalog, check out Ahana Cloud, a managed service for Presto on AWS that provides both Hive Metastore and AWS Glue as a choice of catalog for prestodb.

Presto 104: Running Presto with Hive Metastore on your Laptop

Introduction

This is the 4th tutorial in our Getting Started with Presto series. To recap, here are the first 3 tutorials:

Presto 101: Installing & Configuring Presto locally

Presto 102: Running a three node PrestoDB cluster on a laptop

Presto 103: Running a Prestodb cluster on GCP

Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial we will show you how to run Presto with Hive Metastore on a laptop.

Presto is a disaggregated engine. This means that Presto has the top part of the database stack – the SQL parser, compiler, optimizer, scheduler, execution engine – but it does not have other components of the database – this includes the system catalog. In the data lake world, the system catalog where the database scheme resides lives in what is called a Catalog. There are two popular catalogs that have emerged. From the Hadoop world – the Hive Metastore continues to be widely used. Note this is different from the Hive Query Engine. This is the system catalog – where information about the table schemas and their locations lives. In AWS, the Glue catalog is also very popular. 

In this tutorial, we will focus on using Presto with the Hive Metastore on your laptop.   

What is the Hive Metastore?

The Hive Metastore is the mapping between the database tables and columns and the objects or files that reside in the data lake. This could be a file system when using HDFS or immutable objects in object stores like AWS S3. This document simplifies the process for a laptop scenario to get you started. For real production workload using Ahana cloud which provides Presto as a managed service with Hive Metastore will be a good choice if you are looking at an easy and performant solution for SQL on AWS S3.

Implementation steps

Step 1

Create a docker network namespace, so that containers could communicate with each other using the network namespace.

C:\Users\rupendran>docker network create presto_networkd0d03171c01b5b0508a37d968ba25638e6b44ed4db36c1eff25ce31dc435415b

Step 2

Ahanaio has developed a sandbox for prestodb which can be downloaded from docker hub, use the command below to download prestodb sandbox, which comes with all packages needed to run prestodb.

C:\Users\prestodb>docker pull ahanaio/prestodb-sandbox
Using default tag: latest
latest: Pulling from ahanaio/prestodb-sandbox
da5a05f6fddb: Pull complete                                                               e8f8aa933633: Pull complete                                                               b7cf38297b9f: Pull complete                                                               a4205d42b3be: Pull complete                                                               81b659bbad2f: Pull complete                                                               3ef606708339: Pull complete                                                               979857535547: Pull complete                                                              
Digest: sha256:d7f4f0a34217d52aefad622e97dbcc16ee60ecca7b78f840d87c141ba7137254
Status: Downloaded newer image for ahanaio/prestodb-sandbox:latestdocker.io/ahanaio/prestodb-sandbox:latest

Step 3:  

Start the instance of the the prestodb sandbox and name it as coordinator

#docker run -d -p 8080:8080 -it --net presto_network --name coordinator
ahanaio/prestodb-sandbox
db74c6f7c4dda975f65226557ba485b1e75396d527a7b6da9db15f0897e6d47f

Step 4:

We only want the coordinator to be running on this container without the worker node. So let’s edit  the config.properties file and set the node-scheduler.include-cooridinator to false.

sh-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080discovery-server.enabled=true
discovery.uri=http://localhost:8080
sh-4.2#

Step 5:

Restart the docker container running coordinator. Since we updated the config file to run this instance only as a Presto coordinator and stopped the worker service.

# docker restart coordinator

Step 6:

Create three more containers using ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8081:8081 -it --net presto_network --name worker1  ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8082:8082 -it --net presto_network --name worker2  ahanaio/prestodb-sandbox

user@presto:~$docker run -d -p 8083:8083 -it --net presto_network --name worker3  ahanaio/prestodb-sandbox

Step 7:

Edit the etc/config.properties file in each of the three worker containers and set coordinator to false, http-server.http.port to 8081/8082/8083 respectively for each worker and finally discovery.uri should point to coordinator.

sh-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://coordinator:8080

Step 8:

Now we will Install and configure hive on the coordinator container.

Install wget procps and tar 

# yum install -y wget procps tar less

Step 9:

Download and install hive and hadoop packages, set HOME and PATH for JAVA,HIVE and HADOOP 

#HIVE_BIN=https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
#HADOOP_BIN=https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz


#wget --quiet ${HIVE_BIN}
#wget --quiet ${HADOOP_BIN}


#tar -xf apache-hive-3.1.2-bin.tar.gz -C /opt
#tar -xf hadoop-3.3.1.tar.gz -C /opt
#mv /opt/apache-hive-3.1.2-bin /opt/hive
#mv /opt/hadoop-3.3.1 /opt/hadoop


#export JAVA_HOME=/usr
#export HIVE_HOME=/opt/hive
#export HADOOP_HOME=/opt/hadoop
#export PATH=$PATH:${HADOOP_HOME}:${HADOOP_HOME}/bin:$HIVE_HOME:/bin:.
#cd /opt/hive

Step 10:

Download additional jars needed to run with S3

#wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.10.6/aws-java-sdk-core-1.10.6.jar

#wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar

#wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.4/hadoop-aws-2.8.4.jar

#cp aws-java-sdk-core-1.10.6.jar /opt/hadoop/share/hadoop/tools/lib/
#cp aws-java-sdk-s3-1.10.6.jar  /opt/hadoop/share/hadoop/tools/lib/
#cp hadoop-aws-2.8.4.jar  /opt/hadoop/share/hadoop/tools/lib/

echo "export
HIVE_AUX_JARS_PATH=${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-core-1.10.6.ja

r:${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-s3
1.10.6.jar:${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-aws-2.8.4.jar" >>/opt/hive/conf/hive-env.sh

Step 11:

Configure and start hive

cp /opt/hive/conf/hive-default.xml.template /opt/hive/conf/hive-site.xml
mkdir -p /opt/hive/hcatalog/var/log
bin/schematool -dbType derby -initSchema
bin/hcatalog/sbin/hcat_server.sh start

Step 12:

Create /opt/presto-server/etc/catalog/hive.properties file add the hive endpoint to presto, this file needs to be added on both coordinator and worker containers.

If you choose to validate using AWS S3 bucket provide security credentials for the same.

connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
hive.s3.aws-access-key=<Your AWS Key>
hive.s3.aws-secret-key=<your AWS Secret Key>

Step 13:

Restart the coordinator and all worker containers

#docker restart coordinator
#docker restart worker1
#docker restart worker2
#docker restart worker3

Step 14:

Run the presto-cli and use hive as catalog

bash-4.2# presto-cli --server localhost:8080 --catalog hive

Step 15:

Create schema using local or S3 location.

presto:default> create schema tpch with (location='file:///root');
CREATE SCHEMA
presto:default> use tpch;

If you have access to S3 bucket then use the following create command using s3 as destination 

presto:default> create schema tpch with (location='s3a://bucket_name');
CREATE SCHEMA
presto:default> use tpch;

Step 16:

Hive has option to create two types of table, they are

  • Managed tables 
  • External tables

Managed tables are tightly coupled with data on the destination which means if you delete a table then associated data will also be deleted.

External tables are loosely coupled with data, which means it maintains a pointer to the data.so deletion of the table will not delete data on the external location.

The transactional semantics(ACID) is only supported on managed tables.

We will create managed table under hive.tpch schema

Create table under hive.tpch schema

presto:tpch> create table hive.tpch.lineitem with (format='PARQUET') AS SELECT * FROM tpch.sf1.lineitem;
CREATE TABLE: 6001215 rows
Query 20210921_051649_00015_uvkq7, FINISHED, 2 nodes
Splits: 19 total, 19 done (100.00%)
1:48 [6M rows, 0B] [55.4K rows/s, 0B/s]

Step 17:

Do a desc table to see the table.

presto> desc hive.tpch.lineitem     
-> ;    
Column     |    Type     | Extra | Comment
---------------+-------------+-------+--------- 
orderkey      | bigint      |       | 
partkey       | bigint      |       | 
suppkey       | bigint      |       | 
linenumber    | integer     |       | 
quantity      | double      |       | 
extendedprice | double      |       | 
discount      | double      |       | 
tax           | double      |       | 
returnflag    | varchar(1)  |       | 
linestatus    | varchar(1)  |       | 
shipdate      | date        |       | 
commitdate    | date        |       | 
receiptdate   | date        |       | 
shipinstruct  | varchar(25) |       | 
shipmode      | varchar(10) |       | 
comment       | varchar(44) |       |
(16 rows)
Query 20210922_224518_00002_mfm8x, FINISHED, 4 nodes
Splits: 53 total, 53 done (100.00%)
0:08 [16 rows, 1.04KB] [1 rows/s, 129B/s]

Summary

In this tutorial, we provide steps to use Presto with Hive Metastore as a catalog on a laptop. Additionally AWS Glue can also be used as a catalog for prestodb. If you’re looking to get started easily with Presto and a pre-configured Hive Metastore, check out Ahana Cloud, a managed service for Presto on AWS that provides both Hive Metastore and AWS Glue as a choice of catalog for prestodb.

Connecting to Presto with Superset

Presto with Superset

This blog post will provide you with an understanding of how to connect Superset to Presto.

TL;DR

Superset refers to a connection to a distinct data source as a database. A single Presto cluster can connect to multiple data sources by configuring a Presto catalog for each desired data source. Hence, to make a Superset database connection to a particular data source through Presto, you must specify the Presto cluster and catalog in the SQLAlchemy URI as follows: presto://<presto-username>:<presto-password>@<presto-coordinator-url>:<http-server-port>/<catalog>.

Superset and SQLAlchemy

Superset is built as a Python Flask web application and leverages SQLAlchemy, a Python SQL toolkit, to provide a consistent abstraction layer to relational data sources. Superset uses a consistent SQLAlchemy URI as a connection string for a defined Superset database. The schema for the URI is as follows: dialect+driver://username:password@host:port/database. We will deconstruct the dialect, driver, and database in the following sections.

SQLAlchemy defines a dialect as the system it uses to communicate with the specifics various databases (e.g. flavor of SQL) and DB-API, low level Python APIs to talk to specific relational data sources. A Python DB-API database driver is required for a given data source. For example, PyHive is a DB-API driver to connect to Presto. It is possible for a single dialect to choose between multiple DB-API drivers. For example, PostgreSQL dialect can support the following DB-API drivers: psycopg2, pg8000, psycop2cffi, an pygresql. Typically, a single DB-API driver is set as the default for a dialect and used when no explicit DB-API is specified. For PostgreSQL, the default DB-API driver is psycopg2.

The term database can be confusing since it is heavily loaded. In a typical scenario a given data source, such as PostgeSQL, have multiple logical groupings of tables which are called “databases”. In a way, these “databases” provide namespaces for tables; identically named tables can exist in two different “databases” without collision. As an example, we can use the PostgreSQL instance available when locally installing Superset with Docker Compose.

In this instance of PostgreSQL, we have four databases: postgres, superset, template0, and template1.

superset@localhost:superset> \\l

+-----------+----------+----------+------------+------------+-----------------------+
| Name      | Owner    | Encoding | Collate    | Ctype      | Access privileges     |
|-----------+----------+----------+------------+------------+-----------------------|
| postgres  | superset | UTF8     | en_US.utf8 | en_US.utf8 | <null>                |
| superset  | superset | UTF8     | en_US.utf8 | en_US.utf8 | <null>                |
| template0 | superset | UTF8     | en_US.utf8 | en_US.utf8 | =c/superset           |
|           |          |          |            |            | superset=CTc/superset |
| template1 | superset | UTF8     | en_US.utf8 | en_US.utf8 | =c/superset           |
|           |          |          |            |            | superset=CTc/superset |
+-----------+----------+----------+------------+------------+-----------------------+

We can look into the superset database and see the tables in that database.

The key thing to remember here is that ultimately a Superset database needs to resolve to a collection of tables, whatever that is referred to in a particular dialect.

superset@localhost:superset> \c superset

You are now connected to database "superset" as user "superset"

+--------+----------------------------+-------+----------+
| Schema | Name                       | Type  | Owner    |
|--------+----------------------------+-------+----------|
| public | Clean                      | table | superset |
| public | FCC 2018 Survey            | table | superset |
| public | ab_permission              | table | superset |
| public | ab_permission_view         | table | superset |
| public | ab_permission_view_role    | table | superset |
| public | ab_register_user           | table | superset |
| public | ab_role                    | table | superset |
| public | ab_user                    | table | superset |
| public | ab_user_role               | table | superset |
| public | ab_view_menu               | table | superset |
| public | access_request             | table | superset |
| public | alembic_version            | table | superset |
| public | alert_logs                 | table | superset |
| public | alert_owner                | table | superset |
| public | alerts                     | table | superset |
| public | annotation                 | table | superset |
| public | annotation_layer           | table | superset |
| public | bart_lines                 | table | superset |
| public | birth_france_by_region     | table | superset |
| public | birth_names                | table | superset |
| public | cache_keys                 | table | superset |
| public | channel_members            | table | superset |
| public | channels                   | table | superset |
| public | cleaned_sales_data         | table | superset |
| public | clusters                   | table | superset |
| public | columns                    | table | superset |
| public | covid_vaccines             | table | superset |
:

With an understanding of dialects, drivers, and databases under our belt, let’s solidify it with a few examples. Let’s assume we want to create a Superset database to a PostgreSQL data source and particular PostgreSQL database named mydatabase. Our PostgreSQL data source is hosted at pghost on port 5432 and we will log in as sonny (password is foobar). Here are three SQLAlchemy URIs we could use (actually inspired from the SQLAlchemy documentation):

  1. postgresql+psycopg2://sonny:foobar@pghost:5432/mydatabase We explicitly specify the postgresql dialect and psycopg2 driver.
  2. postgresql+pg8000://sonny:foobar@pghost:5432/mydatabase We use the pg8000 driver.
  3. postgresql://sonny:foobar@pghost:5432/mydatabase We do not explicitly list any driver, and hence, SQLAlchemy will use the default driver, which is psycopg2 for postgresql.

Superset lists its recommended Python packages for database drivers in the public documentation.

Presto Catalogs

Because Presto can connect to multiple data sources, when connecting to Presto as a defined Superset database, it’s important to understand what you are actually making a connection to.

In Presto, the equivalent notion of a “database” (i.e. logical collection of tables) is called a schema. Access to a specific schema (“database”) in a data source, is defined in a catalog.

As an example, the listing below is the equivalent catalog configuration to connect to the example mydatabase PostgreSQL database we described previously. If we were querying a table in that catalog directly from Presto, a fully-qualified table would be specified as catalog.schema.table (e.g. select * from catalog.schema.table). Hence, querying the Clean table would be select * from postgresql.mydatabase.Clean.

connector.name=postgresql
connection-url=jdbc:postgresql://pghost:5432/mydatabase
connection-user=sonny
connection-password=foobar

Superset to Presto

Going back to Superset, to create a Superset database to connect to Presto, we specify the Presto dialect. However, because Presto is the intermediary to an underlying data source, such as PostgreSQL, the username and password we need to provide (and authenticate against) is the Presto username and password. Further, we must specify a Presto catalog for the database in the SQLAlchemy URI. From there, Presto—-through its catalog configuration—-authenticates to the backing data source with the appropriate credentials (e.g sonny and foobar ). Hence, the SQLAlchemy URI to connect to Presto in Superset is as follows: presto://<presto-username>:<presto-password>@<presto-coordinator-url>:<http-server-port>/<catalog>

The http-server-port refers to the http-server.http.port configuration on the coordinator and workers (see Presto config properties); it is usually set to 8080.

New Superset Database Connection UI

In Superset 1.3, there is a feature-flagged version of a new database connection UI that simplifies connecting to data without constructing the SQLAlchemy URI. The new database connection UI can be turned on in config.py with FORCE_DATABASE_CONNECTIONS_SSL = True (PR #14934). The new UI can also be viewed in the Superset documentation.

Try It Out!

In less than 30 minutes, you can get up and running using Superset with a Presto cluster with Ahana Cloud for Presto. Ahana Cloud for Presto is an easy-to-use fully managed Presto service that also automatically stands up a Superset instance for you. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

Presto Tutorial 103: PrestoDB cluster on GCP

Introduction

This tutorial is Part III of our Getting started with PrestoDB series. As a reminder, Prestodb is an open source distributed SQL query engine. In tutorial 102 we covered how to run a three node prestodb cluster on a laptop. In this tutorial, we’ll show you how to run a prestodb cluster in a GCP environment using VM instances and GKE containers.

Environment

This guide was developed on GCP VM instances and GKE containers.

Presto on GCP with VMs

Implementation steps for prestodb on vm instances

Step1: Create a GCP VM instance using the CREATE INSTANCE tab, name it as presto-coordinator. Next, create three more VM instances as presto-worker1, presto-worker2 and presto-worker3 respectively.

Step 2: By default GCP blocks all network ports, so prestodb will need ports 8080-8083 enabled. Use the firewalls rule tab and enable them.

Step 3: 

Install JAVA and python.

Step 4:

Download the Presto server tarball, presto-server-0.253.1.tar.gz and unpack it. The tarball will contain a single top-level directory, presto-server-0.253.1 which we will call the installation directory.

Run the commands below to install the official tarballs for presto-server and presto-cli from prestodb.io

user@presto-coordinator-1:~$ curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.235.1/presto-server-0.235.1.tar.gz
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  721M  100  721M    0     0   245M      0  0:00:02  0:00:02 --:--:--  245M
user@presto-coordinator-1:~$ curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.235.1/presto-cli-0.235.1-executable.jar
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100 12.7M  100 12.7M    0     0  15.2M      0 --:--:-- --:--:-- --:--:-- 15.1M
user@presto-coordinator-1:~$

Step 5:

Use gunzip and tar to unzip and untar the presto-server

user@presto-coordinator-1:~$gunzip presto-server-0.235.1.tar.gz ;tar -xf presto-server-0.235.1.tar

Step 6: (optional)

Rename the directory without version number

user@presto-coordinator-1:~$ mv presto-server-0.235.1 presto-server

Step 7:  

Create etc, etc/catalog and data directories

user@presto-coordinator-1:~/presto-server$ mkdir etc etc/catalog data

Step 8:

Define etc/node.config, etc/config.properties, etc/jvm.config and etc/catalog/jmx.properties files as below for presto co-ordinator server.  

user@presto-coordinator-1:~/presto-server$ cat etc/node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/home/user/presto-server/data

user@presto-coordinator-1:~/presto-server$ cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080

user@presto-coordinator-1:~/presto-server$ cat etc/jvm.config
-server-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true

user@presto-coordinator-1:~/presto-server$ cat etc/log.properties
com.facebook.presto=INFO

user@presto-coordinator-1:~/presto-server$ cat etc/catalog/jmx.properties
connector.name=jmx

Step: 9 

Check the cluster UI status. It should  show the Active worker count at 0 since we enabled only the coordinator.

Step 10: 

Repeat steps 1 to 8 on the remaining 3 vm instances which will act as worker nodes.

On the configuration step for worker nodes, set coordinator to false and http-server.http.port to 8081, 8082 and 8083 for worker1, worker2 and worker3 respectively.

Also make sure node.id and http-server.http.port are different for each worker node.

user@presto-worker1:~/presto-server$ cat etc/node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffffd
node.data-dir=/home/user/presto-server/data
user@presto-worker1:~/presto-server$ cat etc/config.properties
coordinator=false
http-server.http.port=8083
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://presto-coordinator-1:8080

user@presto-worker1:~/presto-server$ cat etc/jvm.config
-server-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true

user@presto-worker1:~/presto-server$ cat etc/log.properties
com.facebook.presto=INFO

user@presto-worker1:~/presto-server$ cat etc/catalog/jmx.properties
connector.name=jmx

Step 11: 

Check cluster status, it should reflect the three worker nodes as part of the prestodb cluster.

Step 12:

Verify the prestodb environment by running the prestodb CLI with simple JMX query

user@presto-coordinator-1:~/presto-server$ ./presto-cli
presto> SHOW TABLES FROM jmx.current;
                                                              Table                                                              
-----------------------------------------------------------------------------------------------------------------------------------
com.facebook.airlift.discovery.client:name=announcer                                                                             
com.facebook.airlift.discovery.client:name=serviceinventory                                                                      
com.facebook.airlift.discovery.store:name=dynamic,type=distributedstore                                                          
com.facebook.airlift.discovery.store:name=dynamic,type=httpremotestore                                                           
com.facebook.airlift.discovery.store:name=dynamic,type=replicator


Implementation steps for Prestodb on GKE containers

Step 1:

Go to the Google cloud Console and activate the cloud console window

Step 2:

Create an artifacts repository using the below command and replace REGION with the valid region you would prefer to create the repository.

gcloud artifacts repositories create ahana-prestodb \
   --repository-format=docker \
   --location=REGION \
   --description="Docker repository

Step 3:

Create the container cluster by using the gcloud command: 

user@cloudshell:~ (weighty-list-324021)$ gcloud config set compute/zone us-central1-c
Updated property [compute/zone].

user@cloudshell:~ (weighty-list-324021)$ gcloud container clusters create prestodb-cluster01

Creating cluster prestodb-cluster01 in us-central1-c…done.
Created 
.
.
.

kubeconfig entry generated for prestodb-cluster01.
NAME                LOCATION       MASTER_VERSION   MASTER_IP     MACHINE_TYPE  NODE_VERSION     NUM_NODES  STATUS
prestodb-cluster01  us-central1-c  1.20.8-gke.2100  34.72.76.205  e2-medium     1.20.8-gke.2100  3          RUNNING
user@cloudshell:~ (weighty-list-324021)$

Step 4:

After container cluster creation, run the following command to see the cluster’s three nodes

user@cloudshell:~ (weighty-list-324021)$ kubectl get nodes
NAME                                                STATUS   ROLES    AGE     VERSION
gke-prestodb-cluster01-default-pool-34d21367-25cw   Ready    <none>   7m54s   v1.20.8-gke.2100
gke-prestodb-cluster01-default-pool-34d21367-7w90   Ready    <none>   7m54s   v1.20.8-gke.2100
gke-prestodb-cluster01-default-pool-34d21367-mwrn   Ready    <none>   7m53s   v1.20.8-gke.2100
user@cloudshell:~ (weighty-list-324021)$

Step 5:

Pull the prestodb docker image 

user@cloudshell:~ (weighty-list-324021)$ docker pull ahanaio/prestodb-sandbox

Step 6:

Deploy ahanaio/prestodb-sandbox locally on the shell and create an image named as coordinator which will later be deployed on the container clusters.

user@cloudshell:~ (weighty-list-324021)$ docker run -d -p 8080:8080 -it –name coordinator ahanaio/prestodb-sandbox
391aa2201e4602105f319a2be7d34f98ed4a562467e83231913897a14c873fd0

Step 7:

Edit the etc/config.parameters file inside the container and set the node-scheduler.include-coordinator property to false. Now restart the coordinator.

user@cloudshell:~ (weighty-list-324021)$ docker exec -i -t coordinator bash                                                                                                                       
bash-4.2# vi etc/config.properties
bash-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery-server.enabled=true
discovery.uri=http://localhost:8080
bash-4.2# exit
exit
user@cloudshell:~ (weighty-list-324021)$ docker restart coordinator
coordinator

Step 8:

Now do a docker commit, create a tag called coordinator based on imageid, this will create a new local image called coordinator.

user@cloudshell:~ (weighty-list-324021)$ docker commit coordinator
Sha256:46ab5129fe8a430f7c6f42e43db5e56ccdf775b48df9228440ba2a0b9a68174c

user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                 TAG       IMAGE ID       CREATED          SIZE
<none>                     <none>    46ab5129fe8a   15 seconds ago   1.81GB
ahanaio/prestodb-sandbox   latest    76919cf0f33a   34 hours ago     1.81GB

user @cloudshell:~ (weighty-list-324021)$ docker tag 46ab5129fe8a coordinator

user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                 TAG       IMAGE ID       CREATED              SIZE
coordinator                latest    46ab5129fe8a   About a minute ago   1.81GB
ahanaio/prestodb-sandbox   latest    76919cf0f33a   34 hours ago         1.81GB

Step 9:

Create tag with artifacts path and copy it over to artifacts location

user@cloudshell:~ docker tag coordinator:latest us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/coord:v1

user@cloudshell:~ docker push us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/coord:v1

Step 10:

Deploy the coordinator into the cloud container using the below kubectl commands.

user@cloudshell:~ (weighty-list-324021)$ kubectl create deployment coordinator –image=coordinator
deployment.apps/coordinator created

user@cloudshell:~ (weighty-list-324021)$ kubectl expose deployment coordinator –name=presto-coordinator –type=LoadBalancer –port 8080 –target-port 8080
service/presto-coordinator exposed

user@cloudshell:~ (weighty-list-324021)$ kubectl get service
NAME                 TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)          AGE
kubernetes           ClusterIP      10.7.240.1    <none>          443/TCP          41m
presto-coordinator   LoadBalancer   10.7.248.10   35.239.88.127   8080:30096/TCP   92s

Step 11:

Copy the external IP on a browser and check the status

Step 12:

Now to deploy worker1 into the GKE container, again start a local instance named worker1 using the docker run command.

user@cloudshell:~ docker run -d -p 8080:8080 -it –name worker1 coordinator
1d30cf4094eba477ab40d84ae64729e14de992ac1fa1e5a66e35ae553964b44b
user@cloudshell:~

Step 13:

Edit worker1 config.properties inside the worker1 container to set coordinator to false and http-server.http.port to 8081. Also the discovery.uri should point to the coordinator container running inside the GKE container.

user@cloudshell:~ (weighty-list-324021)$ docker exec -it worker1  bash                                                                                                                             
bash-4.2# vi etc/config.properties
bash-4.2# vi etc/config.properties
bash-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://presto-coordinator01:8080

Step 14:

Stop the local worker1 container, commit the worker1 as image and tag it as worker1 image

user@cloudshell:~ (weighty-list-324021)$ docker stop worker1
worker1
user@cloudshell:~ (weighty-list-324021)$ docker commit worker1
sha256:cf62091eb03702af9bc05860dc2c58644fce49ceb6a929eb6c558cfe3e7d9abf
ram@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                                                            TAG       IMAGE ID       CREATED         SIZE
<none>                                                                <none>    cf62091eb037   6 seconds ago   1.81GB

user@cloudshell:~ (weighty-list-324021)$ docker tag cf62091eb037 worker1:latest
user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                                                            TAG       IMAGE ID       CREATED         SIZE
worker1                                                               latest    cf62091eb037   2 minutes ago   1.81GB

Step 15:

Push the worker1 image into google artifacts location

user@cloudshell:~ (weighty-list-324021)$ docker tag worker1:latest us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker1:v1

user@cloudshell:~ (weighty-list-324021)$ docker push us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker1:v1
The push refers to repository [us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker1]
b12c3306c4a9: Pushed
.
.
coordinator=false
v1: digest: sha256:fe7db4aa7c9ee04634e079667828577ec4d2681d5ac0febef3ab60984eaff3e0 size: 2201

Step 16:

Deploy and expose the worker1 from the artifacts location into the google cloud container using this kubectl command.

user@cloudshell:~ (weighty-list-324021)$ kubectl create deployment presto-worker01  –image=us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker1:v1                               
deployment.apps/presto-worker01 created

user@cloudshell:~ (weighty-list-324021)$ kubectl expose deployment presto-worker01 –name=presto-worker01 –type=LoadBalancer –port 8081 –target-port 8081                                       
service/presto-worker01 exposed

Step 17:

Check presto UI for successful deployment of worker1

Step 18:

Repeat steps 12 to steps 17 to deploy worker2 inside GKE container:

  • deploy ahana local instance using docker and name it as worker2, 
  • then edit the etc/config.properties inside the worker2 container to set coordinator to false, port to 8082 and discover.uri to the coordinator container name.
  • shut the instance then commit that instance and create docker image as worker2 
  • push that worker2 image to google artifacts location 
  • use kubectl commands to deploy and expose the worker2 instance inside a google container. Now check the prestodb UI for the second worker being active.
  • Check prestodb UI for successful deployment of worker2
user@cloudshell:~ (weighty-list-324021)$ docker run -d -p 8080:8080 -it –name worker2 worker1                                                                                                     
32ace8d22688901c9fa7b406fe94dc409eaf3abfd97229ab3df69ffaac00185d
user@cloudshell:~ (weighty-list-324021)$ docker exec -it worker2 bash
bash-4.2# vi etc/config.properties
bash-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8082
discovery.uri=http://presto-coordinator01:8080
bash-4.2# exit
exit
user@cloudshell:~ (weighty-list-324021)$ docker commit worker2
sha256:08c0322959537c74f91a6ccbdf78d0876f66df21872ff7b82217693dc3d4ca1e
user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                                                              TAG       IMAGE ID       CREATED          SIZE
<none>                                                                  <none>    08c032295953   11 seconds ago   1.81GB

user@cloudshell:~ (weighty-list-324021)$ docker tag 08c032295953 worker2:latest

user@cloudshell:~ (weighty-list-324021)$ docker commit worker2
Sha256:b1272b5e824fdebcfd7d434fab7580bb8660cbe29aec8912c24d3e900fa5da11

user@cloudshell:~ (weighty-list-324021)$ docker tag worker2:latest us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker2:v1

user@cloudshell:~ (weighty-list-324021)$ docker push us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker2:v1
The push refers to repository [us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker2]
aae10636ecc3: Pushed
.
.
v1: digest: sha256:103c3fb05004d2ae46e9f6feee87644cb681a23e7cb1cbcf067616fb1c50cf9e size: 2410

user@cloudshell:~ (weighty-list-324021)$ kubectl create deployment presto-worker02  –image=us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker2:v1
deployment.apps/presto-worker02 created

user@cloudshell:~ (weighty-list-324021)$ kubectl expose deployment presto-worker02 –name=presto-worker02 –type=LoadBalancer –port 8082 –target-port 8082
service/presto-worker02 exposed

user@cloudshell:~ (weighty-list-324021)$ kubectl get service
NAME                   TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)          AGE
kubernetes             ClusterIP      10.7.240.1     <none>           443/TCP          3h35m
presto-coordinator01   LoadBalancer   10.7.241.37    130.211.208.47   8080:32413/TCP   49m
presto-worker01        LoadBalancer   10.7.255.27    34.132.29.202    8081:31224/TCP   9m15s
presto-worker02        LoadBalancer   10.7.254.137   35.239.88.127    8082:31020/TCP   39s

Steps 19:

Repeat steps 12 to steps 18 to provision worker3 inside the google cloud container

user@cloudshell:~ (weighty-list-324021)$ docker run -d -p 8080:8080 -it –name worker3 worker1
6d78e9db0c72f2a112049a677d426b7fa8640e8c1d3aa408a17321bb9353c545

user@cloudshell:~ (weighty-list-324021)$ docker exec -it worker3 bash                                                                                                                              
bash-4.2# vi etc/config.properties
bash-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8083
discovery.uri=http://presto-coordinator01:8080
bash-4.2# exit
Exit

user@cloudshell:~ (weighty-list-324021)$ docker commit worker3
sha256:689f39b35b03426efde0d53c16909083a2649c7722db3dabb57ff0c854334c06
user@cloudshell:~ (weighty-list-324021)$ docker images
REPOSITORY                                                              TAG       IMAGE ID       CREATED          SIZE
<none>                                                                  <none>    689f39b35b03   25 seconds ago   1.81GB
ahanaio/prestodb-sandbox                                                latest    76919cf0f33a   37 hours ago     1.81GB

user@cloudshell:~ (weighty-list-324021)$ docker tag 689f39b35b03 worker3:latest

user@cloudshell:~ (weighty-list-324021)$ docker tag worker3:latest us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker3:v1

user@cloudshell:~ (weighty-list-324021)$ docker push us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker3:v1
The push refers to repository [us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker3]
b887f13ace4e: Pushed
.
.
v1: digest: sha256:056a379b00b0d43a0a5877ccf49f690d5f945c0512ca51e61222bd537336491b size: 2410

user@cloudshell:~ (weighty-list-324021)$ kubectl create deployment presto-worker03  –image=us-central1-docker.pkg.dev/weighty-list-324021/prestodb-ahana/worker3:v1
deployment.apps/presto-worker03 created

user@cloudshell:~ (weighty-list-324021)$ kubectl expose deployment presto-worker02 –name=presto-worker03 –type=LoadBalancer –port 8083 –target-port 8083
service/presto-worker03 exposed



Step 20:

Verify the prestodb environment by running the prestodb CLI with simple JMX query

user@presto-coordinator-1:~/presto-server$ ./presto-cli
presto> SHOW TABLES FROM jmx.current;
                                                              Table                                                              
———————————————————————————————————————————–
com.facebook.airlift.discovery.client:name=announcer                                                                             
com.facebook.airlift.discovery.client:name=serviceinventory                                                                      
com.facebook.airlift.discovery.store:name=dynamic,type=distributedstore                                                          
com.facebook.airlift.discovery.store:name=dynamic,type=httpremotestore                                                           
com.facebook.airlift.discovery.store:name=dynamic,type=replicator

Summary

In this tutorial you learned how to  provision and run prestodb inside Google VM instances and on GKE containers. Now you should be able to validate the functional aspects of prestodb. 

If you want to run production Presto workloads at scale and performance, check out https://www.ahana.io which provides a managed service for Presto.

Announcing the workload profile feature in Ahana Cloud

Ahana Cloud for Presto is the first fully integrated, cloud native managed service that simplifies the ability of cloud and data platform teams. With the managed Presto service, we provide a lot of tuned configurations out of the box for Ahana customers.

We’re excited to announce that the workload profile feature is now available on Ahana Cloud. With this release, users can create a cluster with a validated set of configurations that suits the type of workloads or queries users plan to run.

Today, the Presto clusters are configured with default properties that work well for generic workloads. However, achieving workload-specific tuning and resource allocation requires a good understanding of presto’s resource consumption for that workload. Further, to change or add any property, we first update the configuration file and then restart the cluster. This makes data platform users spend a couple of days and in many cases weeks iterating, evaluating, and experimenting with the config tuning to reach the ideal configuration, specific to the workloads. To solve this pain point and to deliver predictable performance at scale, Ahana Cloud allows user to select tuned set of properties for desired workloads in a one click away.

Here is the short demo of creating Presto Cluster with a workload profile:

Concurrent queries are simply the number of queries executing at the same time in a given cluster. To simplify this experience we have classified workloads based on the number of concurrent queries and curated a set of tuned session properties for each profile.

Low concurrency is useful for clusters that run a limited number of queries or a few large, complex queries. It also supports bigger and heavier ETL jobs.

High concurrency is better for running multiple queries at the same time. For example, dashboard and reporting queries or A/B testing analytics, etc.

This setting can be changed once the cluster has been created and cluster restart is not required. However, the change will only apply to new queries. Following is the short demo on how you can change these profiles for running clusters.

This feature is the beginning of auto-tune capabilities for workload management. We are continuously innovating Ahana Cloud for our customers and to deliver a seamless, easy experience for data teams looking to leverage the power of Presto. Please give it a try and log in to the Ahana Cloud console to get started. We have a free trial as well that you can sign up for today.

Tutorial: How to run SQL queries with Presto on Amazon Redshift

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial is about how to run SQL queries with Presto (running with Kubernetes) on AWS Redshift.

Presto’s Redshift connector allows querying the data stored in an external Amazon Redshift cluster. This can be used to join data between different systems like Redshift and Hive, or between two different Redshift clusters. 

Step 1: Setup a Presto cluster with Kubernetes 

Set up your own Presto cluster on Kubernetes using these instructions or you can use Ahana’s managed service for Presto

Step 2: Setup a Amazon Redshift cluster

Create an Amazon Redshift cluster from AWS Console and make sure it’s up and running with dataset and tables as described here.

Below screen shows Amazon Redshift cluster – “redshift-presto-demo” 

 Further, JDBC URL from Cluster is required to setup a redshift connector with Presto.

You can skip this section if you want to use your existing Redshift cluster, just make sure your redshift cluster is accessible from Presto, because AWS services are secure by default. So even if you have created your Amazon Redshift cluster in a public VPC, the security group assigned to the target Redshift cluster can prevent inbound connections to the database cluster. In simple words, Security Group settings of Redshift database play a role of a firewall and prevent inbound database connections over port 5439.Find the assigned Security Group and check its Inbound rules. 

If your Presto Compute Plane VPC and data sources are in a different VPC then you need to configure a VPC peering connection.

Step 3: Configure Presto Catalog for Amazon Redshift Connector

At Ahana we have simplified this experience and you can do this step in a few minutes as explained in these instructions.

Essentially, to configure the Redshift connector, create a catalog properties file in etc/catalog named, for example, redshift.properties, to mount the Redshift connector as the redshift catalog. Create the file with the following contents, replacing the connection properties as appropriate for your setup:

connection-password=secret
connector.name=redshift
connection-url=jdbc:postgresql://example.net:5439/database
connection-user=root

This is how my catalog properties look like –

  my_redshift.properties: |
      connector.name=redshift   
      connection-user=awsuser
      connection-password=admin1234 
connection-url=jdbc:postgresql://redshift-presto-demo.us.redshift.amazonaws.com:5439/dev

Step 4: Check for available datasets, schemas and tables, etc and run SQL queries with Presto Client to access Redshift database

After successfully database connection with Amazon Redshift, You can connect to Presto CLI and run following queries and make sure that the Redshift catalog gets picked up and perform show schemas and show tables to understand available data. 

$./presto-cli.jar --server https://<presto.cluster.url> --catalog bigquery --schema <schema_name> --user <presto_username> --password

IN the below example you can see a new catalog for Redshift Database got initiated called “my_redshift. ”

presto> show catalogs;
   Catalog   
-------------
 ahana_hive  
 jmx         
 my_redshift 
 system      
 tpcds       
 tpch        
(6 rows)
 
Query 20210810_173543_00209_krtkp, FINISHED, 2 nodes
Splits: 36 total, 36 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Further you can check all available schemas for your Amazon Redshift from Presto to work with.

presto> show schemas from my_redshift;
       Schema       
--------------------
 catalog_history    
 information_schema 
 pg_catalog         
 pg_internal        
 public             
(5 rows)
 
Query 20210810_174048_00210_krtkp, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0:01 [5 rows, 85B] [4 rows/s, 72B/s]

Here, I have used sample data that comes with Redshift Cluster setup. I have chosen the schema “public” which is a part of “dev” Redshift Database. 

presto> show tables from my_redshift.public;
  Table   
----------
 category 
 date     
 event    
 listing  
 sales    
 users    
 venue    
(7 rows)
 
Query 20210810_185448_00211_krtkp, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0:03 [7 rows, 151B] [2 rows/s, 56B/s]

Further, you can explore tables as “sales” in the below example.

presto> select * from my_redshift.public.sales LIMIT 2;
 salesid | listid | sellerid | buyerid | eventid | dateid | qtysold | pricepaid | commission |        saletime         
---------+--------+----------+---------+---------+--------+---------+-----------+------------+-------------------------
   33095 |  36572 |    30047 |     660 |    2903 |   1827 |       2 | 234.00    | 35.10      | 2008-01-01 01:41:06.000 
   88268 | 100813 |    45818 |     698 |    8649 |   1827 |       4 | 836.00    | 125.40     | 2007-12-31 23:26:20.000 
(2 rows)
 
Query 20210810_185527_00212_krtkp, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:03 [18.1K rows, 0B] [6.58K rows/s, 0B/s]

Following are some more complex queries you can run against sample data:

presto:public> -- Find top 10 buyers by quantity
            ->SELECT firstname, lastname, total_quantity 
            -> FROM   (SELECT buyerid, sum(qtysold) total_quantity
            ->         FROM  sales
            ->         GROUP BY buyerid
            ->         ORDER BY total_quantity desc limit 10) Q, users
            -> WHERE Q.buyerid = userid
            -> ORDER BY Q.total_quantity desc;
 firstname | lastname | total_quantity 
-----------+----------+----------------
 Jerry     | Nichols  |             67 
 Armando   | Lopez    |             64 
 Kameko    | Bowman   |             64 
 Kellie    | Savage   |             63 
 Belle     | Foreman  |             60 
 Penelope  | Merritt  |             60 
 Kadeem    | Blair    |             60 
 Rhona     | Sweet    |             60 
 Deborah   | Barber   |             60 
 Herrod    | Sparks   |             60 
(10 rows)
 
Query 20210810_185909_00217_krtkp, FINISHED, 2 nodes
Splits: 214 total, 214 done (100.00%)
0:10 [222K rows, 0B] [22.4K rows/s, 0B/s]
 
presto:public> -- Find events in the 99.9 percentile in terms of all time gross sales.
            -> SELECT eventname, total_price 
            -> FROM  (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) as percentile 
            ->        FROM (SELECT eventid, sum(pricepaid) total_price
            ->              FROM   sales
            ->              GROUP BY eventid)) Q, event E
            ->        WHERE Q.eventid = E.eventid
            ->        AND percentile = 1
            -> ORDER BY total_price desc;
      eventname       | total_price 
----------------------+-------------
 Adriana Lecouvreur   | 51846.00    
 Janet Jackson        | 51049.00    
 Phantom of the Opera | 50301.00    
 The Little Mermaid   | 49956.00    
 Citizen Cope         | 49823.00    
 Sevendust            | 48020.00    
 Electra              | 47883.00    
 Mary Poppins         | 46780.00    
 Live                 | 46661.00    
(9 rows)
 
Query 20210810_185945_00218_krtkp, FINISHED, 2 nodes
Splits: 230 total, 230 done (100.00%)
0:12 [181K rows, 0B] [15.6K rows/s, 0B/s]

Step 5: Run SQL query to join data between different systems like Redshift and Hive

Another great use case of Presto is Data Federation. In this example I will join Apache Hive table with Amazon Redshift table and run JOIN query to access both tables from Presto. 

Here, I have two catalogs “ahana_hive” for Hive Database and “my_redshift” for Amazon Redshift and each database has my_redshift.public.users

 and ahana_hive.default.customer table respectively within their schema.

Following the very simple query to join these tables, the same way you join two tables from the same database. 

presto> show catalogs;
presto> select * from ahana_hive.default.customer;
presto> select * from my_redshift.public.users;
presto> Select * from ahana_hive.default.customer x  join my_redshift.public.users y on x.nationkey = y.userid;

At Ahana, we have made it very simple and user friendly to run SQL workloads on Presto in the cloud. You can get started with Ahana Cloud today and start running sql queries in a few mins.

Announcing the Ahana $20M Series A – Furthering our Vision of Open Data Lake Analytics with Presto

I’m very excited to announce that Ahana, the SaaS for Presto company, has raised a jumbo $20M Series A round from lead investor Third Point Ventures. Our SaaS managed service for open source Presto enables data platform teams to easily and cost effectively deliver powerful SQL processing on all their data. This is our vision of Open Data Lake Analytics and it’s what Facebook, Uber, Twitter, and others in the Presto Foundation have been running at scale.

It’s been only 15 months since the founding of Ahana and I’m extremely proud of what the team has achieved to date. We came out of stealth last year with seed investments totalling $4.8M from GV (Google Ventures), Lux Capital, and Leslie Ventures, who all also participated in this oversubscribed A round. Initially we focused on the Presto community, providing support, open source contributions, and tutorials to help people get started easily with Presto. Subsequently, we announced the first managed service for Presto at PrestoCon in September and GA’d two months later at AWS Re:Invent in December with early customer acclaim. 

I am also excited to have Rob Schwartz, Managing Partner of Third Point Ventures join our board and to partner with Third Point that has tremendous public market investing expertise, relationships, and research capabilities. Third Point has over $25B under management and Rob drives their early stage, growth, and cross-over investments in their emerging technology arm, Third Point Ventures. I value his hands-on experience with young companies, helping them deliver on their vision. For example, he is an active investor in Yellowbrick in the data space. Rob and Third Point had four key reasons to invest in Ahana…the team, the project, the market, and the product-market fit as evidenced by customer traction. Let me touch on these four areas:

  1. THE TEAM: This is my 6th startup tour of duty and I’ll be the first to attest that startups require an incredible amount of energy. That energy comes from the synchronized rowing of the oars to propel the ship, no matter the size (or startup stage of company growth). I’m most proud of our extraordinary team; all their hands on deck that have pulled together so far. The Ahana team includes experts in a range of industry-leading databases: Teradata, IBM DB2, Vertica, Aster Data, and the recently IPO’d Couchbase. In addition to experts in open source & Presto hailing from Facebook, Uber, Walmart and Alibaba. 
  2. THE OPEN SOURCE PRESTO PROJECT: Since we started Ahana last year, the momentum we’ve seen in the Presto community has been phenomenal. We helped the Presto Foundation lead two massively successful PrestoCon’s and numerous meetups across the world, whose membership have crossed over 1000 members. The Docker Sandbox Container for Presto hosted by Ahana has had over 250K pulls just in the last 6 months, and 10 companies are now part of the growing Presto Foundation consortium. We’re both thankful to all those in the community who have helped make Presto what it is and humbled to be a part of its success. We pledge to continue open source code contributions for the benefit of the Presto community for many years to come.
  1. THE MARKET: This $20MM series A raise enables us to further our vision of providing the most flexible and performant Open Data Lake Analytics with Presto. Open Data Lake Analytics is quickly becoming the next big emerging stack, led by hyperscaler bellwether companies like Facebook and Uber. While the data warehouse has been the workhorse of analytics,  it’s also very expensive. To address that, more data is moving into cloud data lakes like AWS S3 because they are so inexpensive and ubiquitous. So we’re seeing the data warehouse getting deconstructed in the cloud with the commodified data lake coupled with open source Presto. This stack enables SQL analytics directly on the data lake, making it the most cost effective, open, and flexible solution.  
  2. PRODUCT-MARKET FIT: Ahana is at the intersection of some rapidly growing trends right now – Open Source, Cloud Analytics, and the Data Lake. And our customer base is proof of that. Having only GA’d this past December, the adoption we’ve seen has been incredible. Companies like Cartona, an eCommerce company out of Egypt, and Carbon, a fast growing ad tech company, are using Ahana and telling the world about their use cases. Securonix, the Gartner magic quadrant leader in the security SIEM space, is a huge proponent of Presto and Ahana and recently joined us at the AWS Startup Showcase. Our customers are building out reporting and dashboarding, customer-facing analytics, data transformation use cases, and much more with Ahana Cloud for Presto. We can’t wait to see what they do next.

Lastly, I’ll mention that this raise enables us to accelerate growth in three main areas: 

  1. Technical innovation of the Presto project, by scaling our product and engineering teams. Btw, we’re an all remote company. 
  2. Adoption via more evangelism for the Presto open source project. We will continue working closely with the community and other Presto Foundation members like Facebook, Uber, and Intel.
  3.  Growing our Marketing and Sales organizations, continuing our focus on customer adoption.

To sum it up, I’d like to share what Rob Schwartz, Managing Partner of Third Point Ventures says about us:

“We’re excited to join the exceptional team at Ahana and assist them in their evolution from rapid, early adoption to substantial market prominence. As we witness the evolution of modern analytics, we’re seeing a new stack emerge adjacent to the data warehouse. Companies need an open, flexible approach to access their data, and the data lake with Presto on top provides that. Ahana Cloud provides the foundation for this new stack, giving all platform teams the ability to easily use Presto on their data lake. With Ahana, any company can leverage the open data lake for their analytics. This is extremely powerful.”

Cheers to our next phase of growth, and did I mention we’re hiring? 😉 

We are just getting started. Join us on this incredible journey at the intersection of cloud, data, and open source…what many unicorns are made of.

Autoscale your Presto cluster in Ahana Cloud

We’re excited to announce that autoscaling is now available on Ahana Cloud. In this initial release, the autoscaling feature will monitor the worker nodes’ average CPU Utilization of your presto worker nodes and scale-out when reaching the 75% threshold. Additionally, Presto clusters have now the ability to scale-in to a minimum number of worker nodes when the cluster is idle for a user-specified amount of time.

Never run out of memory with autoscaling

One of the challenges of running a Presto cluster is to make the right decision in terms of the number of worker nodes required to run your queries. Not all queries are equals and predicting how many nodes will be required is not always possible. With the scale-out feature, the number of worker nodes increases based on the CPU utilization to ensure that your queries can execute without running out of memory. That way you don’t have to worry about whether your deployment can support your requirements. Future iterations will include scale-in based on CPU utilization and autoscaling based on additional metrics.

Save cost with Idle time

When no queries are sent to the Presto clusters, it makes sense to reduce the number of workers nodes but it’s not always practical to do so manually. With the Idle time feature enabled, the system will monitor the queries activity, if no activity is detected for a user-defined period of time, let’s say 15mins, then the number of worker nodes will reduce to its minimum count.

Two common use cases we found that benefit greatly from idle time cost saving are transformation workloads and ad hoc querying.

  • For the transformation workload, the query can potentially run for several hours, making it unpractical to monitor its activity to decide when to manually stop the cluster or reduce the number of running nodes. Idle time cost savings wait for a certain period of inactivity and then reduce the worker node to a minimum automatically until the next query hits the cluster again.
  • For ad hoc querying, like its name suggests, the querying is not continuous and scaling in to a minimum worker node count between each queries will help reduce costs.

Enabling autoscaling

Getting started with autoscaling is easy with this step-by-step walkthrough.

Step 1 – In Cluster settings select Scale Out only (CPU) scaling strategy

Step 2 – Enter a Minimum and a Maximum worker node count as well as a Scale Out step size. The scale-out step size will decide how many nodes get added to the cluster when the scaling out triggers

Step 3 – By default, the cluster will resize to its minimum worker node count defined above after 30mins, this can be set between 10mins and 1 hour

Your new Presto cluster will scale-out up to its maximum worker node count as long as the average CPU Utilization of the worker nodes goes beyond 75%. However, if no queries reach the cluster for a default period of 30mins then the cluster will reduce its worker node count to its minimum.

Enabling Idle time cost saving

Enabling Idle time cost saving is very easy with this step-by-step walkthrough.
As shown in the section above, idle time cost saving is enabled by default in the Scale Out only (CPU) scaling strategy.

For the Static cluster, to enable the feature, you will need to do the following:

Step 1 – Check Scale to a single worker node when idle

Step 2 – By default, the cluster will resize to a single worker node after 30mins, this can be set between 10 mins and 1 hour.

Changing the autoscaling configuration of an existing cluster

You can always change the configuration after a cluster got created by following the steps below:

Step 1 – Navigate to the cluster details view

Step 2 – Edit the cluster scaling policy configuration

Step 3 – The server will update its configuration immediately after clicking the Save button

What’s next?

Log in to the Ahana Cloud console to get started. You can also learn more about autoscaling by heading over to our documentation.

Tutorial: How to run SQL queries with Presto on Google BigQuery

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial is about how to run SQL queries with Presto (running with Kubernetes) on Google BigQuery.

Pretos’s BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.

Step 1: Setup a Presto cluster with Kubernetes 

Set up your own Presto cluster on Kubernetes using these instructions or you can use Ahana’s managed service for Presto

Step 2: Setup a Google BigQuery Project with Google Cloud Platform

Create a Google BigQuery project from Google Cloud Console and make sure it’s up and running with dataset and tables as described here.

Below screen shows Google BigQuery project with table “Flights” 

Step 3: Set up a key and download Google BigQuery credential JSON file.

To authenticate the BigQuery connector to access the BigQuery tables, create a credential key and download it in JSON format. 

Use a service account JSON key and GOOGLE_APPLICATION_CREDENTIALS as described here

Sample credential file should look like this:

{
  "type": "service_account",
  "project_id": "poised-journey-315406",
  "private_key_id": "5e66dd1787bb1werwerd5ddf9a75908b7dfaf84c",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwgKozSEK84b\ntNDXrwaTGbP8ZEddTSzMZQxcX7j3t4LQK98OO53i8Qgk/fEy2qaFuU2yM8NVxdSr\n/qRpsTL/TtDi8pTER0fPzdgYnbfXeR1Ybkft7+SgEiE95jzJCD/1+We1ew++JzAf\nZBNvwr4J35t15KjQHQSa5P1daG/JufsxytY82fW02JjTa/dtrTMULAFOSK2OVoyg\nZ4feVdxA2TdM9E36Er3fGZBQHc1rzAys4MEGjrNMfyJuHobmAsx9F/N5s4Cs5Q/1\neR7KWhac6BzegPtTw2dF9bpccuZRXl/mKie8EUcFD1xbXjum3NqMp4Gf7wxYgwkx\n0P+90aE7AgMBAAECggEAImgvy5tm9JYdmNVzbMYacOGWwjILAl1K88n02s/x09j6\nktHJygUeGmp2hnY6e11leuhiVcQ3XpesCwcQNjrbRpf1ajUOTFwSb7vfj7nrDZvl\n4jfVl1b6+yMQxAFw4MtDLD6l6ljKSQwhgCjY/Gc8yQY2qSd+Pu08zRc64x+IhQMn\nne1x0DZ2I8JNIoVqfgZd0LBZ6OTAuyQwLQtD3KqtX9IdddXVfGR6/vIvdT4Jo3en\nBVHLENq5b8Ex7YxnT49NEXfVPwlCZpAKUwlYBr0lvP2WsZakNCKnwMgtUKooIaoC\nSBxXrkmwQoLA0DuLO2B7Bhqkv/7zxeJnkFtKVWyckQKBgQC4GBIlbe0IVpquP/7a\njvnZUmEuvevvqs92KNSzCjrO5wxEgK5Tqx2koYBHhlTPvu7tkA9yBVyj1iuG+joe\n5WOKc0A7dWlPxLUxQ6DsYzNW0GTWHLzW0/YWaTY+GWzyoZIhVgL0OjRLbn5T7UNR\n25opELheTHvC/uSkwA6zM92zywKBgQC3PWZTY6q7caNeMg83nIr59+oYNKnhVnFa\nlzT9Yrl9tOI1qWAKW1/kFucIL2/sAfNtQ1td+EKb7YRby4WbowY3kALlqyqkR6Gt\nr2dPIc1wfL/l+L76IP0fJO4g8SIy+C3Ig2m5IktZIQMU780s0LAQ6Vzc7jEV1LSb\nxPXRWVd6UQKBgQCqrlaUsVhktLbw+5B0Xr8zSHel+Jw5NyrmKHEcFk3z6q+rC4uV\nMz9mlf3zUo5rlmC7jSdk1afQlw8ANBuS7abehIB3ICKlvIEpzcPzpv3AbbIv+bDz\nlM3CdYW/CZ/DTR3JHo/ak+RMU4N4mLAjwvEpRcFKXKsaXWzres2mRF43BQKBgQCY\nEf+60usdVqjjAp54Y5U+8E05u3MEzI2URgq3Ati4B4b4S9GlpsGE9LDVrTCwZ8oS\n8qR/7wmwiEShPd1rFbeSIxUUb6Ia5ku6behJ1t69LPrBK1erE/edgjOR6SydqjOs\nxcrW1yw7EteQ55aaS7LixhjITXE1Eeq1n5b2H7QmkQKBgBaZuraIt/yGxduCovpD\nevXZpe0M2yyc1hvv/sEHh0nUm5vScvV6u+oiuRnACaAySboIN3wcvDCIJhFkL3Wy\nbCsOWDtqaaH3XOquMJtmrpHkXYwo2HsuM3+g2gAeKECM5knzt4/I2AX7odH/e1dS\n0jlJKzpFpvpt4vh2aSLOxxmv\n-----END PRIVATE KEY-----\n",
  "client_email": "bigquery@poised-journey-678678.iam.gserviceaccount.com",
  "client_id": "11488612345677453667",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x505/bigquery%40poised-journey-315406.iam.gserviceaccount.com"
}

Pro-Tip: Before you move to the next step please try to use your downloaded credential JSON file with other third party sql tools like DBeaver to access your BigQuery Table. This is to make sure that your credentials have valid access rights or to isolate any issue with your credentials.

Step 4: Configure Presto Catalog for Google BigQuery Connector

To configure the BigQuery connector, you need to create a catalog properties file in etc/catalog named, for example, bigquery.properties, to mount the BigQuery connector as the bigquery catalog. You can create the file with the following contents, replacing the connection properties as appropriate for your setup. This should be done via the edit config map to make sure its reflected in the deployment:

kubectl edit configmap presto-catalog -n <cluster_name> -o yaml

Following are the catalog properties that need to be added:

connector.name=bigquery
bigquery.project-id=<your Google Cloud Platform project id>
bigquery.credentials-file=patch/for/bigquery-credentials.json

Following are the sample entries for catalog yaml file:

bigquery.properties:  |
connector.name=bigquery
bigquery.project-id=poised-journey-317806
bigquery.credentials-file=/opt/presto-server/etc/bigquery-credential.json

Step 5: Configure Presto Coordinator and workers with Google BigQuery credential file

To configure the BigQuery connector,

  1. Load the content of credential file as bigquery-credential.json in presto coordinator’s configmap: 

kubectl edit configmap presto-coordinator-etc -n <cluster_name> -o yaml

  1. Add a new session of volumeMounts for the credential file in coordinator’s deployment file: 

    kubectl edit deployment presto-coordinator -n <cluster_name> 

Following the sample configuration, That you can append in your coordinator’s deployment file at the end of volumeMounts section:

volumeMounts:
- mountPath: /opt/presto-server/etc/bigquery-credential.json
  name: presto-coordinator-etc-vol
  subPath: bigquery-credential.json
  1. Load the content of credential file as bigquery-credential.json in presto worker configmap: 

kubectl edit configmap presto-worker-etc -n <cluster_name>  -o yaml

  1. Add a new session of volumeMounts for the credential file in worker’s deployment file: 

kubectl edit deployment presto-worker -n <cluster_name> 

Following the sample configuration, That you can append in your coordinator’s deployment file at the end of volumeMounts section:

volumeMounts:
- mountPath: /opt/presto-server/etc/bigquery-credential.json
  name: presto-worker-etc-vol
  subPath: bigquery-credential.json

Step 6: Setup database connection with Apache Superset

Create your own database connection url to query from Superset with below syntax

presto://<username>:<password>@bq.rohan1.dev.app:443/<catalog_name>

Step 7: Check for available datasets, schemas and tables, etc

After successfully database connection with Superset, Run following queries and make sure that the bigquery catalog gets picked up and perform show schemas and show tables to understand available data.  

show catalogs;

show schemas from bigquery;

show tables from bigquery.rohan88;

Step 8: Run SQL query from Apache Superset to access BigQuery table

Once you access your database schema, you can run SQL queries against the tables as shown below. 

select * from catalog.schema.table;

select * from bigquery.rohan88.flights LIMIT1;

You can perform similar queries from Presto Cli as well, here is another example of running sql queries on different Bigquery dataset from Preso Cli. 

$./presto-cli.jar --server https://<presto.cluster.url> --catalog bigquery --schema <schema_name> --user <presto_username> --password

Following example shows how you can join Google BigQuery table with the Hive table from S3 and run sql queries. 

At Ahana, we have made it very simple and user friendly to run SQL workloads on Presto in the cloud. You can get started with Ahana Cloud today and start running sql queries in a few mins.

Snowflake may not be the silver bullet you wanted for your long term data strategy… here’s why

Since COVID, every business has pivoted and moved everything online, accelerating digital transformation with data and AI. Self-service, accelerated analytics has become more and more critical for businesses and Snowflake did a great job bringing cloud data warehouses into the market when users were struggling with on-prem big data solutions and trying to catch up their cloud journey. Snowflake is designed foundationally to take advantage of the cloud’s benefits, and while Snowflake has benefited from a first-mover advantage, here are the key areas you should think about as you evaluate a cloud data warehouse like Snowflake. 

Open Source; Vendor Lock-in

Using an SQL engine that is open source is strategically important because it allows the data to be queried without the need to ingest it into a proprietary system. Snowflake is not Open Source Software. Only data that has been aggregated and moved into Snowflake in a proprietary format is available to its users. Moreover, Snowflake is pushing back on open source due to their proprietary solutions. Recently, Snowflake announced Snowflake Data Cloud, where they position Snowflake as a platform for “Cloud Data” where organizations can move and store all the data. 

However, surrendering all your data to the Snowflake data cloud model creates vendor lock-in challenges: 

  1. Excessive cost as you grow your data warehouse
  2. If ingested into another system, data is typically locked into formats of the closed source system
  3. No community innovations or way to leverage other innovative technologies and services to process that same data

Snowflake doesn’t benefit from community innovation that true open source projects benefit from. For example, an open source project like Presto has many contributions from engineers across Twitter, Uber, Facebook, Ahana and more. At Twitter, engineers are working on the Presto-Iceberg connector, aiming to bring high-performance data analytics on open table format to the Presto ecosystem. 

Check this short session on an overview of how we are evolving Presto to be the next generation query engine at Facebook and beyond. 

With a proprietary technology like Snowflake, you miss out on community-led contributions that can shape a technology for the best of everyone. 

Open Format

Snowflake has chosen to use a micro-partition file format that might be good for performance but is closed source. The Snowflake engine cannot work directly with most common open formats like Apache Parquet, Apache Avro, Apache ORC, etc. Data can be imported from these open formats to an internal Snowflake file format, but you miss out on the performance optimizations that these open formats can bring to your engine, including dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes, avoid many small files, avoid few huge files, etc.

On the other hand, Presto users can use Tensorflow on the same open formats, like parquet and ORC, so there’s a lot of flexibility that you get with this open data lake architecture. Using open formats gives users the flexibility to pick the right engine for the right job without the need for an expensive migration.

While migrating from legacy data warehouse platforms to Snowflake may offer less friction for cloud adoption, trying to integrate open source formats into a single proprietary solution may not be as simple as sold.

Check this session on how you can leverage Apache Parquet, Apache Hudi and PrestoDB integration to build Open Data Lake.

Federated queries

A SQL engine is needed for both the data lake where raw data resides as well as the broad range of other data sources so that an organization can mix and match data from any source. If your data resides in relational databases, NoSQL databases, cloud storages, file systems like HDFS, etc., then Snowflake is not suitable for your self-service data lake strategy.. You can not run SQL queries across data stored in relational, non-relational, object, and custom data sources using Snowflake.

Workload Flexibility

Today users want to create new applications at the same rate as their data is growing and a single database is not a solution to support a broad range of analytical use cases. One common workload is training and using machine learning models right over warehouse tables or streaming analytics. Snowflake focuses on a traditional data warehouse as a managed service on the cloud and requires proprietary connectors to address these ML/DS workloads, which brings up data lineage challenges.

If you have a lot of unstructured data like text or images, the volume is beyond petabytes, or schema-on-read is a must-have feature, then Snowflake does not fit into your data lake architecture. 

The new generation of open platforms that unifies the data warehouse and advanced analytics is something that Snowflake is not fundamentally designed for; Snowflake is only suitable for data warehouse use cases.

Data Ownership

Snowflake did decouple storage and compute. However, Snowflake does not decouple data ownership. It still owns the compute layer as well as the storage layer. This means users must ingest data into Snowflake using a proprietary format, creating yet another copy of data and also requiring users to move their data out of their own environment. Users lose ownership of their data.

Cost

Users think of Snowflake as an easy and low cost model. However, it gets very expensive and cost prohibitive to ingest data into a Snowflake. Very large data and enterprise grade long running queries can result in significant costs associated with Snowflake.

As Snowflake is not fully decoupled, data is copied and stored into Snowflake’s managed cloud storage layer within Snowflake’s account. Hence, the users end up paying a higher cost to Snowflake than the cloud provider charges, not to mention the costs associated with cold data. Further, Security features come with a higher price with a proprietary tag.

Conclusion

Snowflake may sound appealing in how simple it is to implement a data warehouse in the cloud. However, an open data lake analytics strategy will augment the data warehouse system in places the warehouse may fall short as discussed above,  providing significant long-term strategic benefits  to users. 

With PrestoDB as a SQL Engine for Open Data Lake Analytics, you can execute SQL queries at high-performance, similar to the EDW. Using Presto for your data lake analytics means you don’t have to worry about vendor lock-in and gets you the benefits of open source goodness like RaptorX, Project Aria, Apache Ranger integration, etc. Check this short tutorial on how to query a data lake with Presto

While powerful, Presto can be complex and resource-intensive when it comes to managing and deploying. That’s where Ahana comes in. Ahana Cloud is the easiest managed service for PrestoDB in the cloud. We simplify open source operational challenges and support top innovations in the Presto community. 

As Open Data Lake Analytics evolves, we have a great and advanced roadmap ahead. You can get started with Ahana Cloud today.

Hands-on Presto Tutorial: How to run Presto on Kubernetes

ahana logo

What is Presto?

Presto is a distributed query engine designed from the ground up for data lake analytics and interactive query workloads.

Presto supports connectivity to a wide variety of data sources – relational, analytical, NoSQL, object stores including s search and indexing systems such as elastic and druid. 

The connector architecture abstracts away the underlying complexities of the data sources whether it’s SQL, NoSQL or simply an object store – all the end user needs to care about is querying the data using ANSI SQL; the connector takes care of the rest.

How is Presto typically deployed?

Presto deployments can be found in various flavors today. These include:

  1. Presto on Hadoop: This involves Presto running as a part of a Hadoop cluster, either as a part of open source or commercial Hadoop deployments (e.g. Cloudera) or as a part of Managed Hadoop (e.g. EMR, DataProc) 
  2. DIY Presto Deployments: Standalone Presto deployed on VMs or bare-metal instances
  3. Serverless Presto (Athena): AWS’ Serverless Presto Service
  4. Presto on Kubernetes: Presto deployed, managed and orchestrated via Kubernetes (K8s)

Each deployment has its pros and cons. This blog will focus on getting Presto working on Kubernetes.

All the scripts, configuration files, etc. can be found in these public github repositories:

https://github.com/asifkazi/presto-on-docker

https://github.com/asifkazi/presto-on-kubernetes

You will need to clone the repositories locally to use the configuration files.

git clone <repository url>

What is Kubernetes (K8s)?

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes groups containers that make up an application into logical units for easy management and discovery. 

In most cases deployments are managed declaratively, so you don’t have to worry about how and where the deployment is running. You simply declaratively specify your resource and availability needs and Kubernetes takes care of the rest.

Why Presto on Kubernetes?

Deploying Presto on K8s brings together the architectural and operational advantages of both technologies. Kubernetes’ ability to ease operational management of the application significantly simplifies the Presto deployment – resiliency, configuration management, ease of scaling in-and-out come out of the box with K8s.

A Presto deployment built on K8s leverages the underlying power of the Kubernetes platform and provides an easy to deploy, easy to manage, easy to scale, and easy to use Presto cluster.

Getting Started – What do I need?

Local Docker Setup

To get your bearings and see what is happening with the Docker containers running on Kubernetes, we will first start with a single node deployment running locally on your machine. This will get you familiarized with the basic configuration parameters of the Docker container and make it way easier to troubleshoot.

Feel free to skip the local docker verification step if you are comfortable with docker, containers and Kubernetes.

Kubernetes / EKS Cluster

To run through the Kubernetes part of this tutorial, you need a working Kubernetes cluster. In this tutorial we will use AWS EKS (Elastic Kubernetes Service). Similar steps can be followed on any other Kubernetes deployment (e.g. Docker’s Kubernetes setup) with slight changes e.g. reducing the resource requirements on the containers.

If you do not have an EKS cluster and would like to quickly get an EKS cluster setup, I would recommend following the instructions outlined here. Use the “Managed nodes – Linux” instructions.

You also need to have a local cloned copy of the github repository https://github.com/asifkazi/presto-on-kubernetes

Nodegroups with adequate capacity

Before you go about kicking off your Presto cluster, you want to make sure you have node groups created on EKS with sufficient capacity.

After you have your EKS cluster created (in my case it’s ‘presto-cluster’), you should go in and add a node group which has sufficient capacity for the Presto Docker containers to run on. I plan on using R5.2xlarge nodes. I setup a node group of 4 nodes (You can tweak your Presto Docker container settings accordingly and use smaller nodes if required).

Figure 1: Creating a new nodegroup

Figure 2: Setting the instance type and node count

Once your node group shows active you are ready to move onto the next step

Figure 3: Make sure your node group is successfully created and is active

Tinkering with the Docker containers locally

Let’s first make sure the Docker container we are going to use with Kubernetes is working as desired. If you would like to review the Docker file, the scripts and environment variable supported the repository can be found here.

The details of the specific configuration parameters being used to customize the container behavior can be found in the entrypoint.sh script. You can override any of the default values by providing the values via –env option for docker or by using name-value pairs in the Kubernetes yaml file as we will see later.

You need the following:

  1. A user and their Access Key and Secret Access Key for Glue and S3 (You can use the same or different user): 

 arn:aws:iam::<your account id>:user/<your user>

  1. A role which the user above can assume to access Glue and S3:

arn:aws:iam::<your account id>:role/<your role>

Figure 4: Assume role privileges

Figure 5: Trust relationships

Graphical user interface, text, application

Description automatically generated

  1. Access to the latest docker image for this tutorial asifkazi/presto-on-docker:latest

Warning: The permissions provided above are pretty lax, giving the user a lot of privileges not just on assume role but also what operations the user can perform on S3 and Glue. DO NOT use these permissions as-is for production use. It’s highly recommended to tie down the privileges using the principle of least privilege (only provide the minimal access required)

Run the following commands:

  1. Create a network for the nodes

docker create network presto

  1. Start a mysql docker instance

docker run --name mysql -e MYSQL_ROOT_PASSWORD='P@ssw0rd$$' -e MYSQL_DATABASE=demodb -e MYSQL_USER=dbuser -e MYSQL_USER=dbuser -e MYSQL_PASSWORD=dbuser  -p 3306:3306 -p 33060:33060 -d --network=presto mysql:5.7

  1. Start the presto single node cluster on docker

docker run -d --name presto \

 --env PRESTO_CATALOG_HIVE_S3_IAM_ROLE="arn:aws:iam::<Your Account>:role/<Your Role>"  \

--env PRESTO_CATALOG_HIVE_S3_AWS_ACCESS_KEY="<Your Access Key>" \

--env PRESTO_CATALOG_HIVE_S3_AWS_SECRET_KEY="<Your Secret Access Key>" \

--env PRESTO_CATALOG_HIVE_GLUE_AWS_ACCESS_KEY="<Your Glue Access Key>" \

--env PRESTO_CATALOG_HIVE_GLUE_AWS_SECRET_KEY="<Your Glue Secret Access Key>" \

--env PRESTO_CATALOG_HIVE_METASTORE_GLUE_IAM_ROLE="arn:aws:iam:: <Your Account>::role//<Your Role>" \

-p 8080:8080 \

--network=presto \

asifkazi/presto-on-docker:latest

  1. Make sure the containers came up correctly:

docker ps 

  1. Interactively log into the docker container:

docker exec -it presto bash

  1. From within the docker container we will verify that everything is working correctly:
  1. Run the following command:

presto

  1. From within the presto cli run the following:

show schemas from mysql

The command should show the mysql databases

  1. From within the presto cli run the following:

show schemas from hive

The command should show the databases from glue. If you are using glue for the first time you might only see the information_schema and default database.

We have validated that the docker container itself is working fine, as a single node cluster (worker and coordinator on the same node). We will not move to getting this environment now working in Kubernetes. But first, let’s clean up.

Run the following command to stop and cleanup your docker instances locally.

docker stop mysql presto;docker rm mysql presto;

Getting Presto running on K8s

To get presto running on K8s, we will configure the deployment declaratively using YAML files. In addition to Kubernetes specific properties, we will provide all the docker env properties via name value pairs.

  1. Create a namespace for the presto cluster

kubectl create namespace presto

  1. Override the env settings in the presto.yaml file for both the coordinator and worker sections
  1. Apply the yaml file to the Kubernetes cluster

kubectl apply -f presto.yaml –namespace presto

  1. Let’s also start a mysql instance. We will first start by creating a persistent volume and claim. 

kubectl apply -f ./mysql-pv.yaml --namespace presto

  1. Create the actual instance

kubectl apply -f ./mysql-deployment.yaml --namespace presto

  1. Check the status of the cluster make sure there are no errored or failing pods

kubectl get pods -n presto

  1. Log into the container and repeat the verification steps for mysql and Hive that we executed for docker. You are going to need the pod name for the coordinator from the command above.

kubectl exec -it  <pod name> -n presto  -- bash

kubectl exec -it presto-coordinator-5294d -n presto  -- bash

Note: the space between the —  and bash is required

  1. Querying seems to be working but is the Kubernetes deployment a multi-node cluster? Let’s check:

select node,vmname,vmversion from jmx.current."java.lang:type=runtime";

  1. Let’s see what happens if we destroy one of the pods (simulate failure)

kubectl delete pod presto-worker-k9xw8 -n presto

  1. What does the current deployment look like?

What? The pod was replaced by a new one presto-worker-tnbsb!

  1. Now we’ll modify the number of replicas for the workers in the presto.yaml
  1. Set replicas to 4

Apply the changes to the cluster

kubectl apply -f presto.yaml –namespace presto

Check the number of running pods for the workers

kubectl get pods -n presto

Wow, we have a fully functional presto cluster running! Imagine setting this up manually and tweaking all the configurations yourself, in addition to managing the availability and resiliency. 

Summary

In this tutorial we setup a single node Presto cluster on Docker and then deployed the same image to Kubernetes. By taking advantage of the Kubernetes configuration files and constructs, we were able to scale out the Presto cluster to our needs as well as demonstrate resiliency by forcefully killing off a pod.

Kubernetes and Presto, better together. You can run large scale deployments of one or more Presto clusters with ease.

Ready for your next Presto lesson from Ahana? Check out our guide to running Presto with AWS Glue as catalog on your laptop.

Presto 102 Tutorial: Install PrestoDB on a Laptop or PC

Summary

Prestodb is an open source distributed parallel query SQL engine. In tutorial 101 we walk through manual installation and configuration on a bare metal server or on a VM. It is a very common practice to try prestodb on a laptop for quick validation and this guide, Tutorial 102, will walk through simple steps to install a three node prestodb cluster on a laptop. 

Environment

This guide was developed using a laptop running windows OS and docker on it. 

Steps for Implementing Presto

Step 1: 

Create a docker network namespace, so that containers could communicate with each other using the network namespace.

C:\Users\rupendran>docker network create presto_networkd0d03171c01b5b0508a37d968ba25638e6b44ed4db36c1eff25ce31dc435415b

Step 2

Ahana has developed a sandbox for prestodb which can be downloaded from docker hub, use the below command to download prestodb sandbox, which comes with all packages needed to run prestodb.

C:\Users\prestodb>docker pull ahanaio/prestodb-sandbox
Using default tag: latestlatest: Pulling
from ahanaio/prestodb-sandboxda5a05f6fddb: Pull complete
e8f8aa933633: Pull complete
b7cf38297b9f: Pullcomplete
a4205d42b3be: Pull complete
81b659bbad2f: Pull complete
3ef606708339: Pull complete
979857535547: Pull complete
Digest: sha256:d7f4f0a34217d52aefad622e97dbcc16ee60ecca7b78f840d87c141ba7137254Status:
Downloaded newer image for ahanaio/prestodb-sandbox:latestdocker.io/ahanaio/prestodb-sandbox:latest

Step 3:

Start the instance of the the prestodb sandbox and name it as coordinator.

C:\Users\prestodb>docker run -d -p 8080:8080 -it --net presto_network --name coordinator ahanaio/prestodb-sandboxdb74c6f7c4dda975f65226557ba485b1e75396d527a7b6da9db15f0897e6d47f

Step 4:

Check cluster UI on the status, by default the Ahana prestodb sandbox comes with one worker and a coordinator.

If only the coordinator needs to be running without the worker node then edit  the config.properties file and set the node-scheduler.include-cooridinator to false.

sh-4.2# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery-server.enabled=true
discovery.uri=http://localhost:8080
sh-4.2#

Now the prestodb UI will show Active worker count to 0.

Step 5: 

Start another instance of prestodb sandbox which will run as worker node

C:\Users\rupendran>docker run -d -p 8081:8081 -it --net presto_network --name workerN1 ahanaio/prestodb-sandbox
80dbb7e1d170434e06c10f9316983291c10006d53d9c6fc8dd20db60ddb4a58c

Step 6: 

Since sandbox comes with coordinator it needs to be disabled for the second instance and run it as worker node, to do that click on the terminal window on the docker container/apps UI  and edit etc/config.properties file to set coordinator to false and set http port to be different from coordinator.

sh-4.2# cat etc/config.properties
coordinator=false
http-server.http.port=8081
discovery.uri=http://coordinator:8080
sh-4.2#

Step 7:

Restart the worker1 container check prestodb UI, now the active worker count will be set to either 1 if co-ordinator runs without a worker node or 2 if the coordinator also runs a worker node.

Step 8:

Repeat steps 5 to 7 to add a third worker node. 

  • start new instance of Ahanaio/prestodb sandbox
  • Disable coordinator and set port to be different than coordinator and set URI to the container name of the coordinator

C:\Users\rupendran>docker run -d -p 8082:8082 -it --net presto_network --name workerN2 ahanaio/prestodb-sandbox
16eb71da54d4a9c30947970ff6da58c65bdfea9cb6ad0c76424d527720378bdd

Step 9: 

Check cluster status, should reflect the third worker node as part of the prestodb cluster.

Step 10:

Verify the prestodb environment by running the prestodb CLI with simple tpch query

sh-4.2# presto-cli
presto> SHOW SCHEMAS FROM tpch;

Schema

information_schema
sf1
sf100
sf1000
sf10000
sf100000
sf300
sf3000
sf30000
tiny
(10 rows)

Query 20210709_195712_00006_sip3d, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0.01 [10 rows, 119B] [12 row/s, 153B/s]

presto>

Summary:

Prestodb cluster installation is simplified with Ahana prestodb sandbox. It’s now ready to be used for any functional validation.

Enabling spill to disk for optimal price per performance

Presto was born out of the need for low-latency interactive queries on large scale data, and hence, continually optimized for that use case. In such scenarios, the best practice is to properly size Presto Worker nodes such that all the aggregate cluster memory can fit all the data required for target data sources, queries, and concurrency level. In addition, to ensure fairness of memory across queries and prevent deadlocks, by default, Presto will kill queries that exceed configured memory limits.

The Case for Spill to Disk

As Presto usage and adoption continues to grow, it is being used for more and more different use cases. For some of these use cases, the need for full memory bandwidth and low-latency is not necessary. For example, consider long running queries on large historical data, such as logs, where low latency results is not paramount. In these cases, it may be acceptable, and even more optimal overall, to trade some performance for cost savings. One way to achieve this is, of course, to use lower memory Presto Workers. However, perhaps these longer batch workloads where higher latency is tolerable is not the norm, but the minority case. Enter Presto spill to disk functionality where Presto can be configured to spill intermediate data from memory to disk when needed. While queries that spill to disk have longer execution times compared to an entire in-memory equivalent, the query will not fail due to exceeding configured memory properties.

Cost Savings of Spill to Disk

Let’s walk through a practical example of a spill to disk scenario. A 15 Worker Presto cluster of r5.2xlarge instances (64 GB memory, 8 vCPU, $0.5 per hour) in AWS costs about $180 per day with an aggregate cluster memory of close to 1 TB (960 GB actual). Instead of a 15 Worker Presto cluster, if we had a cluster with 30% less Presto Workers at 10 nodes, we would be decreasing the aggregate cluster memory by 320 GB. But, let’s say with augment the cluster with 1 TB (> 3x of 350 GB) of disk storage across all the remaining 10 nodes (100 GB per node) to leverage Presto disk spilling. At $0.10 per GB-month for gp2 EBS volumes, the storage costs is only $100 per month. The storage costs is less than a 1% of the memory costs, even with a 3x factor.

Spill to Disk Configuration

There are several configuration properties that need to be set to use spill to disk and they are documented in the Presto documentation. Here is an example configuration with 50 GB of storage allocated to each Worker for spilling.

experimental.spiller-spill-path=/path/to/spill/directory
experimental.spiller-max-used-space-threshold=0.7
experimental.max-spill-per-node=50GB
experimental.query-max-spill-per-node=50GB
experimental.max-revocable-memory-per-node=50GB
  • experimential.spiller-spill-path: Directoy where spilled content will be written.
  • experimental.spiller-max-used-space-threshold: If disk space usage ratio of a given spill path is above this threshold, the spill path will not be eligible for spilling.
  • experimental.max-spill-per-node: Max spill space to be used by all queries on a single node.
  • experimental.query-max-spill-per-node: Max spill space to be used by a single query on a single node.
  • experimental.max-revocable-memory-per-node: How much revocable memory any one query is allowed to use.

Conclusion

Several large scale Presto deployments take advantage of spill to disk, including Facebook. Today, the Ahana Cloud managed Presto service enables spill to disk by default and sets the per node spill to disk limit at 50 GB. We will be releasing the ability for customers to configure and tune their per node spill-to-disk size soon. Give it a try. You can sign up and start using our service today for free.

Presto 101 Tutorial: Installing & Configuring Presto

Installing & Configuring Presto locally

Presto Installation

Presto can be installed manually or using docker images on:

  • Single Node: Both co-ordinator and workers run on the same machine.
  •  or even multiple machines depending on the workload requirements.

Manual Installing Presto

Download the Presto server tarball, presto-server-0.253.1.tar.gz and unpack it. The tarball will contain a single top-level directory, presto-server-0.253.1 which we will call the installation directory.

Run the commands below to install the official tarballs for presto-server and presto-cli from prestodb.io

[root@prestodb_c01 ~]# curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.235.1/presto-server-0.235.1.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed100  721M  100  721M    0     0  72.9M      0  0:00:09  0:00:09 --:--:--  111M[root@prestodb_c01 ~]# curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.235.1/presto-cli-0.235.1-executable.jar
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed100 12.7M  100 12.7M    0     0  21.9M      0 --:--:-- --:--:-- --:--:-- 21.9M

Data Directory

Presto needs a data directory for storing logs, etc. We recommend creating a data directory outside of the installation directory, which allows it to be easily preserved when upgrading Presto.

[root@prestodb_c01 ~]# mkdir -p /var/presto/data

Configuration Settings

Create an etc directory inside the installation directory. This will hold the following configuration:

  • Node Properties: environmental configuration specific to each node
  • JVM Config: command-line options for the Java Virtual Machine
  • Config Properties: configuration for the Presto server
  • Catalog Properties: configuration for Connectors (data sources)
[root@prestodb_c01 ~]# mkdir etc

Node Properties

The node properties file, etc/node.properties contains configuration specific to each node. A node is a single installed instance of Presto on a machine. This file is typically created by the deployment system when Presto is first installed. The following is a minimal etc/node.properties:

[root@prestodb_c01 ~]# cat etc/node.propertiesnode.environment=productionnode.id=ffffffff-ffff-ffff-ffff-ffffffffffffnode.data-dir=/var/presto/data

The above properties are described below:

  • node.environment: The name of the environment. All Presto nodes in a cluster must have the same environment name.
  • node.id: The unique identifier for this installation of Presto. This must be unique for every node. This identifier should remain consistent across reboots or upgrades of Presto. If running multiple installations of Presto on a single machine (i.e. multiple nodes on the same machine), each installation must have a unique identifier.
  • node.data-dir: The location (filesystem path) of the data directory. Presto will store logs and other data here.

JVM configuration

The JVM config file, etc/jvm.config, contains a list of command-line options used for launching the Java Virtual Machine. The format of the file is a list of options, one per line. These options are not interpreted by the shell, so options containing spaces or other special characters should not be quoted.

The following provides a good starting point for creating etc/jvm.config:

[root@prestodb_c01 ~]# cat etc/jvm.config
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

Because an OutOfMemoryError will typically leave the JVM in an inconsistent state, we write a heap dump (for debugging) and forcibly terminate the process when this occurs.

Config Properties

The config properties file, etc/config.properties, contains the configuration for the Presto server. Every Presto server can function as both a coordinator and a worker, but dedicating a single machine to only perform coordination work provides the best performance on larger clusters.

In order to set up a single machine for testing that will function as both a coordinator and worker, then set the below parameters to true in etc/config.properties

[root@singlenode01 ~]# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://example.net:8080
  • coordinator: Allow this Presto instance to function as a coordinator (accept queries from clients and manage query execution).
  • node-scheduler.include-coordinator: Allow scheduling work on the coordinator. 
  • http-server.http.port: Specifies the port for the HTTP server. Presto uses HTTP for all communication, internal and external.
  • query.max-memory: The maximum amount of distributed memory that a query may use.
  • query.max-memory-per-node: The maximum amount of user memory that a query may use on any one machine.
  • query.max-total-memory-per-node: The maximum amount of user and system memory that a query may use on any one machine, where system memory is the memory used during execution by readers, writers, and network buffers, etc.
  • discovery-server.enabled: Presto uses the Discovery service to find all the nodes in the cluster. Every Presto instance will register itself with the Discovery service on startup. In order to simplify deployment and avoid running an additional service, the Presto coordinator can run an embedded version of the Discovery service. It shares the HTTP server with Presto and thus uses the same port.
  • discovery.uri: The URI to the Discovery server. Because we have enabled the embedded version of Discovery in the Presto coordinator, this should be the URI of the Presto coordinator. Replace example.net:8080 to match the host and port of the Presto coordinator. This URI must not end in a slash.

You may also wish to set the following properties:

  • jmx.rmiregistry.port: Specifies the port for the JMX RMI registry. JMX clients should connect to this port.
  • jmx.rmiserver.port: Specifies the port for the JMX RMI server. Presto exports many metrics that are useful for monitoring via JMX.

Log Levels

The optional log levels file, etc/log.properties allows setting the minimum log level for named logger hierarchies. Every logger has a name, which is typically the fully qualified name of the class that uses the logger. 

[root@coordinator01 ~]# cat  etc/log.properties
com.facebook.presto=INFO

There are four levels: DEBUG, INFO, WARN and ERROR.

Catalog Properties

Presto accesses data via connectors, which are mounted in catalogs. The connector provides all of the schemas and tables inside of the catalog. 

Catalogs are registered by creating a catalog properties file in the etc/catalog directory. For example, create etc/catalog/jmx.properties with the following contents to mount the jmx connector as the jmx catalog

[root@coordinator01 ~]# mkdir etc/catalog
[root@coordinator01 ~]# echo "connector.name=jmx" >>
etc/catalog/jmx.properties

Running Presto

The installation directory contains the launcher script in bin/launcher. Presto can be started as a daemon by running the following:

[root@hsrhvm01 presto-server-0.235.1]# bin/launcher start
Started as 23378

After launching, you can find the log files in var/log:

  • launcher.log: This log is created by the launcher and is connected to the stdout and stderr streams of the server. It will contain a few log messages that occur while the server logging is being initialized and any errors or diagnostics produced by the JVM.
  • server.log: This is the main log file used by Presto. It will typically contain the relevant information if the server fails during initialization. It is automatically rotated and compressed.
  • http-request.log: This is the HTTP request log which contains every HTTP request received by the server. It is automatically rotated and compressed.

Why I’m betting on PrestoDB, and why you should too!

Presto Open Data Lake

By Dipti Borkar, Ahana Cofounder, Chief Product Officer & Chief Evangelist

I’ve been in open source software companies and communities for over 10 years now, and in the database industry my whole career. So I’ve seen my fair share of the good, the bad, and the ugly when it comes to open source projects and the communities that surround them. And I’d like to think that all this experience has led me to where I am today – cofounder of Ahana, a company that’s betting big on an open source project: PrestoDB. Let’s first talk about the problem we’re trying to solve.

The Big Shift 

Organizations have been using costly, proprietary data warehouses as the workhorse for analytics over many years. And in the last few years, we’ve seen a shift to cloud data warehouses, like Snowflake, AWS RedShift and BigQuery

Couple that with the fact that organizations have much more data (from terabytes to 10’s and 100’s of terabytes to even petabytes) and more different types of data (telemetry, behavioral, IoT, and event data in addition to enterprise data), there’s even greater urgency for users to not get locked in to proprietary formats and proprietary systems. 

These shifts along with AWS commoditizing the storage layer that is ubiquitous and affordable means that a lot more data is now in cloud object stores like AWS S3, Google Cloud Storage and Azure Blob Store. 

So how do you query the data stored in these data lakes?  How do you ask questions of the data and pull the answers into reports and dashboards? 

The “SQL on S3” Problem 

  • Data lakes are only storage – there is no intelligence in the data lake 
  • If the data is structured, think in tables and columns, SQL is hands down the best way to query it – hey it’s survived 50+ years 
  • If data is semi-structured, think nested like JSON etc. SQL still can be used to query with extensions to the language 
  • But SQL on data lakes was complicated. Hadoop tried it, we know that didn’t work. While it sort of solved this problem, trying to get 70+ different components and projects to integrate and work turned out to be a nightmare for data platform teams. 
  • There was really no simple yet performant way of querying S3

That is where Presto comes in. 

Presto is the best engine built to directly query open formats and data lakes. Presto replaced Hive at Facebook. Presto is the heart of the modern Open Data Lake Analytics Stack – an analytics stack that includes open source, open formats, open interfaces, and open cloud. (You can read more details about this in my Dataversity article.) 

The fact of the matter is that this problem can be solved with many different open source projects/engines. At Ahana, we chose Presto – the engine that runs at Facebook, Uber and Twitter. 

Why PrestoDB?

1. Crazy good tech 

Presto is in memory, scalable and built like a database. And with the new innovations getting added, Presto is only becoming bigger and better. 

Like I mentioned earlier, I’ve been in open source and the database space for a long time and built multiple database engines – both structured and semi-structured (aka NoSQL). I believe that PrestoDB is the query engine most aligned with the direction the analytics market is headed. At Ahana, we’re betting on PrestoDB and have built a managed service around it that solves the problem for Open Data Lake Analytics – the SQL on S3 problem. 

No other open source project has come close. Some that come to mind –

  • Apache Drill unfortunately lost its community. It had a great start being based on the Dremel paper (published by Google) but over time didn’t get the support it should have from the vendor behind it and the community fizzled. 
  • SparkSQL (built on Apache Spark) isn’t built like a database engine but instead is bolted on top of a general purpose computation engine
  • Trino, a hard fork of Presto, is largely focused on a different problem of broad access across many different data sources versus the data lake. I fundamentally believe that all data sources are NOT equal and that data lakes will be the most important data source over time, overtaking the data warehouses. Data sources are not equal for a variety of reasons: 
    • Amount of data stored 
    • Type of information stored 
    • Type of analysis that can be supported on it 
    • Longevity of data stored
    • Cost of managing and processing the data 

Given that 80-90% of the data lives on S3, I’ve seen that 80-90% of analytics will be run on this cheapest data source. That data source is the data lake. And the need to perform a correlation across more data sources comes up only 5-10% of the time for a window of time until the data from those data sources also gets ingested into the data lake. 

As an example: MySQL is the workhorse for operational transactional systems. A complex analytical query with multi-way joins from a federated engine could bring down the operational system. And while there may be a small window of time when the data in MySQL is not available in the data lake, it will eventually be moved over to the lake, and that’s where the bulk of the analysis will happen. 

2. Vendor-neutral Open Source, not Single-vendor Open Source

On top of being a great SQL query engine, Presto is open source and most importantly part of Linux Foundation – governed with transparency and neutrality. On the other hand, Trino, the hard fork of Presto, is a single-vendor project, with most of the core contributors being employees of the vendor. This is problematic for any company planning on using a project as a core component of their data infrastructure, moreso for a vendor like Ahana that needs to be able to support its customers and contribute and enhance the source code. 

Presto is hosted under Linux Foundation in the Presto Foundation, similar to how Kubernetes is hosted by Cloud Native Computing Foundation (CNCF) under the Linux Foundation umbrella. Per the bylaws of the Linux Foundation, Presto will always stay open source and vendor neutral. It was a very important consideration for us that we could count on the project remaining free and open forever, given so many examples where we have seen single-vendor projects changing their licenses to be more restrictive over time to meet the vendor’s commercialization needs. 

For those of you that follow open source, you most likely saw the recent story on Elastic changing its open-source license which created quite a ripple in the community. Without going into all the details, it’s clear that this move prompted a backlash from a good part of the Elastic community and its contributors. This wasn’t the first “open source” company to do this (see MongoDB, Redis, etc.), nor will it be the last. I believe that over time, users will always pick a project driven by a true gold-standard foundation like Linux Foundation or Apache Software Foundation (ASF) over a company-controlled project. Kubernetes over time won the hearts of engineers over alternatives like Apache Mesos and Docker Swarm and now has one of the biggest, most vibrant user communities in the world. I see this happening with Presto.

3. Presto runs in production @ Facebook, runs in production @ Uber & runs in production @ Twitter. 

The most innovative companies in the world run Presto in production for interactive SQL workloads. Not Apache Drill, not SparkSQL, not Apache Hive and not Trino. The data warehouse engine at Facebook is Presto. Ditto that for Uber, likewise for Twitter. 

When a technology is used at the scale these giants run at, you know you are not only getting  technology created by the brightest minds but also tested at internet scale. 

The Conclusion 

And that’s why I picked Presto, that’s PrestoDB and there’s only 1 Presto. 

In summary, I believe that Presto is the de facto standard for SQL analytics on data lakes. It is the heart of the modern open analytics stack and will be the foundation of the next 10 years of open data lake analytics. 

Join us on our mission to make PrestoDB the open source, de facto standard query engine for everyone.

5 main reasons Data Engineers move from AWS Athena to Ahana Cloud

In this brief post, we’ll discuss the 5 main reasons why data platform engineers decide to move their data analytics workloads from Amazon Athena to Ahana Cloud for Presto.

While AWS Athena’s serverless architecture means users don’t need to scale, provision, or manage any servers, there are trade-offs with a serverless approach around performance, pricing, and several technical limitations.

What are AWS Athena and Ahana Cloud for Presto?

Presto is an open source distributed SQL query engine designed for petabyte-scale interactive analytics against a wide range of data sources, from your data lake to traditional relational databases.

Ahana Cloud for Presto provides a fully managed Presto cloud service in AWS, with a wide range of native Presto connectors support, IO caching, optimized configurations for your workload.

AWS Athena is a serverless interactive query service built on Presto that developers use to query AWS S3-based data lakes and other data sources.

While there are some benefits to AWS Athena, let’s talk about why the data engineers we talk to migrate to Ahana Cloud.

1. Need for Concurrency & Partitions

AWS Athena maximum concurrency is limited to 20-25 queries depending on the region, and users must request increased quotas. Some users even observe a max concurrency nearer to 3. Athena users can only run up to 5 queries simultaneously for each account, and Athena restricts each account to 100 databases. Athena’s partition limit is 20K partitions per table when using the Hive catalog. These limitations pose challenges if you have a complex query in-front of queries that are more latency-sensitive workloads like serving up results to a user-facing dashboard.

Ahana Cloud on the other hand runs any amount of queries when you need them. You have full transparency into what’s going on under the hood. You get unlimited concurrency because you can simply scale the number of distributed workers.

2. Need for Performance predictability

When using AWS Athena you don’t control the number of underlying servers that AWS allocates to Athena to run your queries. As the Athena service is shared, the performance characteristics can change frequently and substantially. One minute there may be 50 servers, the next only 10 servers.

With Ahana Cloud, because you have full control of your deployment, performance is always consistent and many times, faster than Athena.

3. Need for more Data source connectors

AWS Athena doesn’t use native Presto connectors, so you’ll need to use the limited options AWS provides or build your own with the AWS Lambda service.

In Ahana Cloud, you can define and manage data sources in the SaaS console, you can also attach or detach them from any cluster with the click of a button. Connect your existing Amazon database services like RDS / MySQL, RDS / PostgreSQL, Elasticsearch and Amazon Redshift.

4. Need for control over the underlying engine

AWS Athena’s serverless nature may make it easy to use, but it also means users have no control over adding more sessions, resources, debugging, etc.

In Ahana Cloud however, you control the number of Presto nodes in your deployment, and you choose the node instance-types for optimum price/performance. That’s easy with the full visibility provided via dashboards on performance and query management.

5. Need for Price predictability

AWS Athena billing is per query, based on volume of data scanned, making it inefficient and expensive at scale. Because costs are hard to control and predict, it leads to “bill shock” for some users. If one query scans one terabyte, that’s $5 for a few seconds.

Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, increased resilience, speed, and ease of use. Plus, Ahana is pay-as-you-go pricing – only pay for what you use. Using the same example, $5 lets you run a 6 node cluster of r5.xlarge instances for an hour, or hundreds of queries instead of just one.

Summary

AWS Athena Serverless architecture makes it really easy to get started with, however, the service has many different limitations that can cause problems, and many data engineering teams have spent hours trying to diagnose them. Due to these limitations, AWS Athena can run slowly and increase operational costs.

Ahana Cloud for Presto is the first fully integrated, cloud-native managed service for Presto that simplifies the ability of cloud and data platform teams of all sizes to provide self-service, SQL analytics for their data analysts and scientists. And all this without the limits of AWS Athena.

Ahana Cloud is available in AWS. You can sign up and start using our service today for free.

Presto Foundation compared to Single Vendor-Driven Open Source Projects

Why Project Neutrality Matters

Steven Mih, Cofounder and CEO, Ahana

PrestoCon Day was last week on March 24, and it was an incredible event with lots of great speakers, great attendance, and positivity around the Presto community. I gave a lightning talk called the “Top Ten Reasons to Use and Contribute to Presto.”

In Letterman style, it’s a read out of each reason, starting from 10 and counting down to 1. Number 9 was:

Presto is neutrally governed with the stated principle that no one individual or company can control the project.

At this point in my presentation, some attendees spammed the chat thread with links claiming that Trino has neutrality, implying that Presto does not. (Here’s a snippet of that song if you’re interested:)

This blog will share data points on why this claim on neutrality is false.

First off, what is Neutrality?

Some say the word “neutrality” with a tone of righteousness, implying freedom or independence of some sort. But that isn’t what the word means. As a noun, “neutrality” simply means impartiality. That when there is a decision to be made there isn’t support for one side vs. another.

Presto Foundation is neutrally-governed

Here’s two data points that show Presto is neutrally-governed: 

#1: Linux Foundation’s Presto Foundation has adopted these principles since Dec 2019: 

Slide from PrestoCon Day Opening Presentation

#2: The Presto Foundation Governing Board operates with these principles, with oversight from Linux Foundation. They have grown and diversified Presto Foundation membership from 4 member companies to 10 member companies today. They have worked with the project’s Technical Steering Committee to grow representation from 2 companies and to 5 companies today. 

Presto is controlled by multiple end-user companies and vendor companies with a common set of principles and Linux Foundation oversight. 

So Presto is a neutrally-governed OSS project.

Single vendor-driven OSS projects don’t have neutrality

Single vendor OSS projects are ones where the majority of committers work for one vendor company – see companies like MongoDB, Elastic, and Starburst. In this case, OSS projects are tied to vendors and their commercial agendas – they are not impartial, so they are not neutral.

With that in mind, here are two data points that show Trino is a single vendor-driven project:

#1) The Trino Software Foundation owners are Martin Traverso, Dain Sundstrom and David Phillips:

“The foundation will be led by the three original creators of Presto: Martin Traverso, Dain Sundstrom and David Phillips.” (source: press release)

#2) The Starburst cofounders and CTO’s are Martin Traverso, Dain Sundstrom and David Phillips.

Trino is in a foundation that is controlled by cofounders of one vendor company, so it is a single vendor-driven OSS project.

Therefore, Trino does not have neutrality. 

Why Neutrally-Governed Open Source Matters to You

For a Linux Foundation open source project, neutrality principles give users and developers confidence that they will be treated without a commercial agenda. In practice, some examples are:  

  • Breaking changes to the code base have to be proposed and agreed.
  • Your code contributions will be viewed purely on their technical merit. 
  • Your ability to become a committer is not limited by a vendor agenda.
  • More community innovation at the core of the project, not just peripheral features.
  • The project’s open source license will not be changed. 

I’ve been an employee in 3 different single vendor-driven OSS projects before and have seen the kinds of decisions that are made in those circumstances. The common pattern is what I call the “EE vs. CE” internal debate between sales and developers on how to balance the open sourcing of new features in the Community Edition (CE) while keeping the money-making features proprietary in the Enterprise Edition (EE). As the company grows there’s almost always a shift to the money-making features of the company. 

Join the Presto community! 

We believe in Presto and the community benefits of a neutral governance model. 

The combination of stated principles and oversight has helped Presto flourish, instead of leaning towards the agenda of any one individual, set of individuals, or company. The Presto Foundation works impartially together as a community.

Upsolver and Ahana Announce Technology Partnership to Accelerate Adoption of Open Data Lake Analytics in the Cloud with Presto

Todd Odess, Head of Partnerships at Upsolver | Ali LeClerc, Head of Marketing at Ahana

Today we’re thrilled to announce our joint technology partnership. Upsolver, the first no-code, high-speed cloud compute layer for making data lakes analytics-ready and Ahana, the managed service for Presto, offer an end-to-end cloud-native solution for open data lake analytics. This solution gives anyone the ability to prepare data for analysis and query. Upsolver’s visual UI lets users easily land and automatically optimize data in AWS S3, and then query it with Ahana Cloud for Presto. 

Customers get to value quickly and affordably, without having to write and maintain Spark-based data pipelines or manage complex query engines. Over the long-term, they benefit from greater analytics agility, being able to productionize new use cases in days rather than weeks or months. 

Companies today collect and manage massive amounts of data – on AWS S3 alone there are 10,000+ data lakes. But studies show that companies only analyze 12% of their data; the rest is locked up in siloed systems, expensive on-premises solutions, and incompatible data formats. Open Data Lake Analytics helps users unlock that data. 

Open Data Lake Analytics is an approach to traditional data analytics that opens up a wider range of data for analytics in the cloud. It’s built on open source Presto, the distributed SQL query engine, and supports open formats (Apache Parquet, Apache ORC, and more) and open interfaces that integrate with existing SQL systems. It can also run on any cloud. 

Together, Ahana and Upsolver give organizations a cloud-native platform that analyzes large volumes of data at any speed – real-time or batch – automating the convoluted process of building cloud data lakes plus managing Presto. Working together enables data teams to manage complex pre-query data engineering and a distributed query engine without having deep technical knowledge. 

Upsolver provides a visual, SQL-based interface to easily create and manage tables, as well as deep integration with AWS S3, AWS Glue Catalog, and PrestoDB for high performance queries. 

We believe this partnership that provides open data lake analytics is critical to maximize the value of our customers’ data lakes. Together we accelerate time to value on data lake analytics projects and make it simple enough for any data practitioner to analyze any dataset.

This joint solution brings the power of technologies used at internet-scale companies like Facebook, Uber, Twitter to data teams of any size. It helps companies expand beyond the cloud data warehouse to a multifaceted analytical stack, opening up a broad set of advanced use cases and the flexibility.

We look forward to this new partnership with Upsolver and making the open cloud data lake accessible for everyone.

Additionally, Ahana and Upsolver are both part of the AWS Partner Network and premier members of the Linux Foundation’s Presto Foundation, a foundation that governs the community-driven PrestoDB project.

Ahana and Upsolver will be holding a free virtual hands-on workshop on March 23 to help users get up and running with the platform. Registration is now open.

The joint solution is available today, users can select Ahana as an output for Upsolver:

Send us a note if you’re interested in learning more! Looking forward to seeing all you do with Upsolver and Ahana 😀

Comparing AWS Athena and PrestoDB Blog Series: Athena Alternatives

This is the 4th blog in our comparing AWS Athena to PrestoDB series. If you missed the others, you can find them here:

Part 1: AWS Athena vs. PrestoDB Blog Series: Athena Limitations
Part 2: AWS Athena vs. PrestoDB Blog Series: Athena Query Limits
Part 3: AWS Athena vs. PrestoDB Blog Series: Athena Partition Limits

What is Athena?

Amazon Athena is an interactive query service based on Presto that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage. Athena is great for interactive querying on datasets already residing in S3 without the need to move the data into another analytics database or a cloud data warehouse. Athena (engine 2) also provides federated query capabilities, which allows you to run SQL queries across data stored in relational, non-relational, object, and custom data sources.

Why would I not want to use Athena?

There are various reasons users look for alternative options to Athena, in spite of its advantages: 

  1. Performance consistency: Athena is a shared, serverless, multi-tenant service deployed per-region. If too many users leverage the service at the same time in a region, users across the board start seeing query queuing and latencies. Query concurrency can be challenging due to limits imposed on accounts to avoid users from overwhelming the regional service.
  2. Cost per query: Athena charges based on Terabytes of data scanned ($5 per TB). If your datasets are not very large, and you don’t have a lot of users querying the data often, Athena is the perfect solution for your needs. If however, your datasets are large in the order of hundreds or thousands of queries, scanning over terabytes or petabytes of data Athena may not be the most cost-effective choice.
  3. Visibility and Control: There are no knobs to tweak in terms of capacity, performance, CPU, or priority for the queries. You have no visibility into the underlying infrastructure or even into the details as to why the query failed or how it’s performing. This visibility is important from a query tuning and consistency standpoint and even to reduce the amount of data scanned in a query.
  4. Security: In spite of having access controls via IAM and other AWS security measures, some customers simply want better control over the querying infrastructure and choose to deploy a solution that provides better manageability, visibility, and control.
  5. Feature delays: Presto is evolving at an expedited rate, with new performance features, SQL functions, and optimizations being contributed by the community as well as companies such as Facebook, Alibaba, Uber, and others periodically. Amazon caught up with version 0.217 only in Nov 2020. With the current version of Presto DB being 0.248, if you need the performance, features, and efficiencies that newer versions provide you are going to have to wait for some time.

What are the typical alternatives to Athena?

Depending upon a user’s business need and the level of control desired users, leverage one or more of the following options:

DIY open-source PrestoDB

Instead of using Athena, users deploy open-source PrestoDB in their environment (either On-Premises or in the Cloud). This mode of deployment gives the user the most amount of flexibility in terms of performance, price, and security; however, it comes at a cost. Managing a PrestoDB deployment requires expertise and resources (personnel and infrastructure) to tweak, manage and monitor the deployment. 

Large scale DIY PrestoDB deployments do exist at enterprises that have mastered the skills of managing large-scale distributed systems such as Hadoop. These are typically enterprises maintaining their own Hadoop clusters or companies like FAANG (Facebook, Amazon, Apple, Netflix, Google) and tech-savvy startups such as Uber, Pinterest, just to name a few.

The cost of managing an additional PrestoDB cluster may be incremental for a customer already managing large distributed systems, however, for customers starting from scratch, this can be an exponential increase in cost.

Managed Hadoop and Presto

Cloud providers such as AWS, Google, and Azure provide their own version of Managed Hadoop.

AWS provides EMR (Elastic Map Reduce), Google provides Data Proc and Azure provides HDInsight. These cloud providers support compatible versions of Presto that can be deployed on their version of Hadoop.

This option provides a “middle ground” where you are not responsible for managing and operating the infrastructure as you would traditionally do in a DIY model, but instead are only responsible for the configuration and tweaks required. Cloud provider-managed Hadoop deployments take over most responsibilities of cluster management, node recovery, and monitoring. Scale-out becomes easier at the push of a button, as costs can be further optimized by autoscaling using either on-demand or spot instances.

You still need to have the expertise to get the most of your deployment by tweaking configurations, instance sizes, and properties.

Managed Presto Service

If you would rather not deal with what AWS calls the “undifferentiated heavy lifting”, a Managed Presto Cloud Service is the right solution for you.

Ahana Cloud provides a fully managed Presto cloud service, with a wide range of native Presto connectors support, IO caching, optimized configurations for your workload. An expert service team can also work with you to help tune your queries and get the most out of your Presto deployment. Ahana’s service is cloud-native and runs on Amazon’s Elastic Kubernetes Service (EKS) to provide resiliency, performance, scalability and also helps reduce your operational costs. 

A managed Presto Service such as Ahana gives you the visibility you need in terms of query performance, instance utilization, security, auditing, query plans as well as gives you the ability to manage your infrastructure with the click of a button to meet your business needs. A cluster is preconfigured with optimum defaults and you can tweak only what is necessary for your workload. You can choose to run a single cluster or multiple clusters. You can also scale up and down depending upon your workload needs.

Ahana is a premier member of the Linux Foundation’s Presto Foundation and contributes many features back to the open-source Presto community, unlike Athena, Presto EMR, Data Proc, and HDInsight. 

Presto vs Athena: To Summarize

You have a wide variety of options regarding your use of PrestoDB. 

If maximum control is what you need and you can justify the costs of managing a large team and deployment, then DIY implementation is right for you. 

On the other hand, if you don’t have the resources to spin up a large team but still want the ability to tweak most tuning knobs, then a managed Hadoop with Presto service may be the way to go. 

If simplicity and accelerated go-to-market are what you seek without needing to manage a complex infrastructure, then Ahana’s Presto managed service is the way to go. Sign up for our free trial today.

We also have a case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment. You can download it here.

Athena Partition Limits | Comparing AWS Athena & PrestoDB

This is our 3rd blog in the comparing AWS Athena to PrestoDB series. If you missed the first two, you can find them here:

Part 1: AWS Athena vs. PrestoDB Blog Series: Athena Limitations
Part 2: AWS Athena vs. PrestoDB Blog Series: Athena Query Limits

Partitioning is a great way to increase performance, but AWS Athena partition limits could lead to poor performance, query failures, and wasted time trying to diagnose query problems. Athena is an AWS serverless interactive service to query AWS data lakes on Amazon S3 using regular SQL. It is a low-cost service; you only pay for the queries you run. Athena engine v2 is built on an older version of Presto DB (v 0.217), and developers use Athena for analytics on data lakes and across data sources in the cloud. A common limitation that could lead to poor performance is Athena’s partitioning. 

Partitioning data

Partitioning splits your table into parts and keeps the related data together based on column values. Partitions are like virtual columns that help the system to scan less data per query. Queries that constrain on the partitioning column(s) will run substantially faster because the system can reduce the volume of data scanned by the query when using filters based on the partition. 

Athena Hive partitioning 

AWS Athena supports Apache Hive partitioning. You can partition your data by one or more columns. For example you can partition based on time, which can lead to a multi-level partitioning scheme. It means that you then have your data organized by key values that may look like this: …/month=01/day=01/year=2021/… This is perfect if your queries constrain on year and/or month and/or day. 

Athena has the MSCK REPAIR TABLE command which updates the partition metadata stored in the catalog. It is an inefficient command when there are a large number of partitions however. The more partitions you have, the slower this command runs. This command gathers metadata by scanning all the partitions in the specified table and updates the metadata in the Hive metastore accordingly so the system knows about any new partitions. However the command could time out after 30 minutes if it has a lot of work to do, this being Athena’s default query time limit.  The command should run to completion so that all the partitions are discovered and cataloged, and it should be run every time new partitions are added e.g. after each ETL/data ingest cycle.

AWS Glue partitioning 

AWS Glue is an Extract-Transform-and-Load (ETL) service that has a central metadata repository called AWS Glue Data Catalog. You can use it to perform ETL operations and store metadata to enable data lake querying. Also, it helps you to partition your data; you can create a Glue table of an Amazon S3 folder, which is a partitioned table that has key values in its path like Hive. Automated Glue crawlers help you to keep the Glue catalog in sync with Amazon S3, which is something you can’t do with Hive.

AWS Athena partition limits

If you are using Hive metastore as your catalog with Athena, the max number of partitions per table or the Athena partition limit is 20,000. You can request a quota increase from AWS.

If you are using AWS Glue (partition limits apply) with Athena, the Glue catalog limit is 1,000,000 partitions per table. 

AWS Athena alternatives with no partitioning limitations

Open Source PrestoDB

Deploying PrestoDB on your own is one way to avoid Athena’s partitioning limitations. PrestoDB doesn’t have a hard partition limit, which helps boost your performance, and you are free to deploy the latest version of Presto and so benefit from all its features.

PrestoDB has the Hive system.sync_partition_metadata function to update partitions in metastore; it works better than the MSCK REPAIR TABLE command that AWS Athena uses. Because AWS Athena built on an older version of PrestoDB, it doesn’t have this function.

What is Presto? Get the Free Whitepaper

Ahana Cloud for Presto

If you’re ready to take advantage of the full power of PrestoDB, you can use Ahana Cloud – the managed service for Presto in the cloud. Ahana Cloud for Presto is a fully managed PrestoDB cloud service for AWS to query and analyze AWS data lakes on Amazon S3 and many other data sources using PrestoDB. 

Ahana runs as a SaaS offering and gives you the ability to deploy and manage Presto at any scale without having to manage PrestoDB or the underlying containerization, and without the restrictions of Athena’s partitioning or concurrency limits. You can easily and quickly deploy multiple Presto clusters, attach data sources, and scale clusters in and out on demand in seconds. You can use Hive or Glue catalogs with Ahana Cloud, and it has its own integrated, fully managed Hive metastore for convenience.

Check out the case study from ad tech company Adroitts on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

Related Articles 

5 main reasons Data Engineers move from AWS Athena to Ahana Cloud

Learn the 5 main reasons why data platform engineers decide to move their data analytics workloads from Amazon Athena.

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

Athena Query Limits | Comparing AWS Athena & PrestoDB

Welcome to the 2nd blog in our blog series on comparing AWS Athena, a serverless Presto service, to open source PrestoDB. In this series we’ll discuss Amazon’s Athena service versus PrestoDB and some of the reasons why you might choose to deploy PrestoDB on your own instead of using the AWS Athena service.

See our first blog in this series on AWS Athena Limitations.

Taking a look at AWS Athena Query Limits

Amazon Athena allows users to perform ad-hoc analytics on data lakes without the need for time-consuming Extract Transform Load (ETL) cycles. In order to maintain the availability of the service when processing such vast and diverse data, certain design decisions and limitations were introduced. Some of these are:

Inherent limits

  • Query string limit: 262,144 bytes.
  • Some statements like EXPLAIN, MERGE, and UPDATE are not supported.
  • Only one query per user is permitted, and 5 to 200 queries per account.
  • Queries can timeout on tables with thousands of partitions.
  • Cross-region querying is limited to sixteen regions.
  • AWS Athena partition limits happen because a Create Table As (CTAS) or INSERT INTO query can create up to a maximum of 100 partitions in a destination table. To work around this limitation you must use a series of INSERT INTO or CTAS statements that insert up to 100 partitions each.
  • Query timeouts can occur on tables with many thousands of partitions. This can happen when the table has many partitions that are not of type string.
  • Athena concurrent queries limit occurs due to limits imposed on accounts to avoid users from overwhelming the regional service.

Configurable quotas

  • Queries can be queued or canceled by the Athena service when exceeding a data threshold.
  • An adjustable quota of 20 DDL (Data Definition Language, like CREATE TABLE) active queries (600-minute timeout) and 20-25 DML (Data Manipulation Language – like SELECT and CREATE TABLE AS) active queries (30-minute timeout) is permitted.
  • Maximum 20,000 partitions per table. 
  • Maximum number of databases is 100. Each database can contain a maximum of 100 tables.

Other limitations

  • Full-table scans are necessary due to the lack of indices.
  • AWS Athena’s federated queries cannot be used with views, and the required Athena Data Source Connectors (that run on AWS Lambda) are complicated to setup.
  • Nodes can run out of memory when sorting columns.
  • There is no default support for stored procedures (but can be added with an adapter).
  • Partition creation is restricted to 100 at once (workarounds exist).
  • The same server pools run queries for multiple accounts, so queries tend to get enqueued based on the service load.
  • ..and much more. You can read more details about AWS Athena’s limitations in my other blog post on the topic.

Configuring and fine-tuning Presto or working around the limitations of Amazon Athena takes a lot of experience. A solution like Ahana Cloud can help.

Ahana Cloud for Presto is a fully managed PrestoDB cloud service for AWS to query and analyze AWS data lakes stored in Amazon S3, and many other data sources. Unlike Athena, it uses the latest version of PrestoDB. Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, speed and ease of use. Ahana is a SaaS offering with an easy to use console UI. Anyone at any knowledge level can use it with ease, there is zero configuration effort and no configuration files to manage. Many companies have moved from AWS Athena to Ahana Cloud. You can try Ahana Cloud today as a free trial.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

Up next: AWS Athena Partition Limits

Ahana Cloud for Presto 🌤 – Now “Open” for “Open Analytics”

I’m thrilled to share that today Ahana Cloud for Presto on AWS is generally available and open for Open Analytics. Ahana Cloud 🌤 is the first fully integrated, cloud-native managed service for Presto that simplifies the ability of cloud and data platform teams of all sizes to provide self-service, SQL analytics for their data analysts and scientists.

When we announced the Early Access for Ahana Cloud this past September, we had no idea of the phenomenal response we would get for our managed service. Before I get into the big improvements since our early release, a huge thank you 🙏🏼 to all the users that participated in the program and for all the valuable feedback particularly on making PrestoDB easier to manage and operate. And we are really sorry if we missed you during the early access. But today Ahana Cloud for Presto is available to everyone! So give it a try.

I’ve blogged and written quite a bit about open analytics this past year. After talking with hundreds of users who are building their next generation analytics platform on the data lake, I firmly believe that we are seeing an alternative architecture – an open, more flexible approach to augment the cloud data warehouse approach – where users may get locked-in to proprietary data technology and formats.

Companies like Facebook, Uber, Twitter, and Alibaba adopted this loosely coupled disaggregated stack that enabled querying across many databases and data lakes became the dominant standard for their analytics – with the tightly coupled data warehousing approach relegated to legacy workloads. 

This new architecture is what I’m coining Open Analytics, and it’s built on Presto. And that’s why we built Ahana Cloud – with a mission of simplifying data lake analytics with Presto.

Ahana Cloud brings a SaaS open source analytics option to users with Presto at its core, using open formats and open interfaces. As the first fully integrated, cloud native managed service that simplifies the ability of cloud and data platform teams to provide self-service, Ahana Cloud brings the power of Presto to any data team of any size and skill level. And it is thrilling to see quite a bit of customer adoption just in our early access program. 

What’s new in Ahana Cloud?

With this announcement I’m pleased to share some of the latest innovations on the platform.

  • AWS Marketplace Pay-as-you-Go (PAYGO): Ahana Cloud for Presto is elastically priced based on usage, with PAYGO and annual options via AWS Marketplace, starting from $0.25 per Ahana cloud credit. Try out our pricing calculator.
  • Better logging and monitoring integration
    • Query the Presto Query Log with Presto – We decided the best way to experience the power of Presto is for us and our users to try out Presto on data generated by Presto. The Presto query log is automatically integrated into your account and stored in an S3 data lake. We now have an option to integrate this query log and attach it as an external table to an Ahana-managed Hive Metastore
  • Presto coordinator logs and Apache Superset logs are integrated into Amazon CloudWatch for each access and integration into your monitoring system 
  • Improved security 
    • We have added mandatory password protection on all Presto clusters created and enabled HTTPS-only access. No Http access is allowed to the cluster. This is very critical given how easy it is to add data sources to Ahana Cloud with a few clicks. 
  • Data catalog optionality 
    • The Ahana-managed Hive metastore is now optional. You can also bring your own Amazon Glue Catalog or user-managed Hive Metastore. This means lower costs so that you only pay for what you actually need and use. When a cluster is stopped and moves to an inactive state, if there is no attached Ahana-managed Hive Metastore – costs go down to ZERO. On cluster start, all cluster config including data catalogs attached are preserved giving users a one-click start. 

In addition, users continue to benefit from: 

  • Easy-to-use Ahana SaaS Console for creation, deployment and management of multiple Presto clusters within a user’s AWS account bringing the compute to user’s data 
  • Support for Amazon Simple Storage Service (Amazon S3), Amazon Relational Database (Amazon RDS) for MySQL, Amazon RDS for PostgreSQL and Amazon Elasticsearch 
  • Click-button integration for user-managed Hive Metastores and Amazon Glue
  • Apache Superset Sandbox for admins to ensure connectivity and access
  • Cloud-native, highly scalable and available containerized environment deployed on Amazon EKS

Additionally, we also announced a go-to-market solution in partnership with Intel which includes an Open Data Lake Analytics Accelerator Package for Ahana Cloud users. It leverages Intel Optane on the cloud with AWS. 

Ahana Cloud is available in AWS. You can sign up and start using our service today for free.

What’s next

Today, in under 30 minutes you can be up and running with Presto and Ahana Cloud. The combination of a cloud managed service with federated SQL analytics has so far proven to be a game-changer for our customers. 

We look forward to continuing to innovate Ahana Cloud for our customers and to deliver a seamless, easy experience for data teams looking to leverage the power of Presto. 

Sign up for Ahana Cloud

Register for our webinar we’re co-hosting with AWS: 0 to Presto in 30 minutes

Read the press release

Top 5 reasons Presto is the foundation of the data analytics stack and why you should use Presto

Ashish Tadose, Co-founder and Principal Software Engineer, Ahana

The need for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data continues to grow explosively. Data platform teams are increasingly using the federated SQL query engine PrestoDB to run such analytics for a variety of use cases across a wide range of data lakes and databases in-place, without the need to move data. PrestoDB is hosted by the Linux Foundation’s Presto Foundation and is the same project running at massive scale at Facebook, Uber and Twitter. Presto analytics are becoming very popular.

Let’s look at some important characteristics of Presto that account for its growing adoption.  

  1. Easier integration with ecosystem 

Presto was designed to seamlessly integrate with an existing data ecosystem without any modification needed to the on-going system. It’s like turbocharging your existing stack with an additional faster data access interface.

Presto provides an additional compute layer for faster analytics. It doesn’t store the data, which gives it the massive advantage of being able to scale resources for queries up and down f based on the demand.

This compute and storage separation makes the Presto query engine extremely suitable for cloud environments. Most of the cloud deployments leverage object storage, which is already disintegrated from the compute layer, and auto-scale to optimize resource costs.

  1. Unified SQL interface 

SQL is by far the oldest and the most widely-used language for data analysis. Analysts, data engineers and data scientists use SQL for exploring data, building dashboards, and testing hypotheses with notebooks like Jupyter and Zeppelin, or with BI tools like Tableau, PowerBI, and Looker, etc. 

Presto is a federated query engine that has the ability to query data not just from distributed file systems, but also from other sources such as NoSQL stores like Cassandra, Elasticsearch, and RDBMS and even message queues like Kafka.

  1. Performance 

The Facebook team developed Presto because Apache Hive was not suitable for interactive queries. Hive’s  underlining architecture , which executes queries by executing multiple MapReduce and Tez jobs, works very well for large, complex jobs, but does not suffice for low-latency queries. The Hive project has recently introduced in-memory caching with Hive LLAP; however it workswell for certain kinds of queries, but it also makes Hive more resource-intensive. 

Similarly, Apache Spark works very well for large, complex jobs using in-memory computation. However, it is not as efficient as Presto interactive BI queries. 

Presto is built for high performance, with several key features and optimizations, such as code-generation,in-memory processing & pipelined execution. Presto queries share a long-lived Java Virtual Machine (JVM) process on worker nodes, which avoids overhead of spawning new JVM containers.

  1. Query Federation

Presto provides a single unified SQL dialect that abstracts all supported data sources. This is a powerful feature which eliminates the need for users to understand connections and SQL dialects of underlying systems. 

  1. Design suitable for cloud 

Presto’s fundamental design of running storage and compute separately makes it extremely convenient to operate in cloud environments. Since the Presto cluster doesn’t store any data, it can be auto-scaled depending on the load without causing any data loss.

As you can see Presto offers numerous advantages for interactive ad hoc queries. No wonder data platform teams are increasingly using Presto as the de facto SQL query engine to run analytics across data sources in-place, without the need to move data.

—————–

To learn more about Presto, listen to the Tech Talk Series: Getting Started with Presto on demand at your convenience.

Building Managed Services: Architecting Ahana Cloud for Presto with the In-VPC Deployment Model

Ahana Cloud for Presto

Gary Stafford, Solutions Architect, AWS & James Mesney, Solutions Engineer, Ahana

Note: This article originally appeared on the AWS Startups blog.

Ahana is the startup that provides the first cloud-native managed service for Presto, the fast-growing, open source distributed SQL engine. Backed by GV (formerly known as Google Ventures) and Lux Ventures, the Ahana team includes experts in Presto, AWS, and big data. This blog post discusses how AWS users have evolved their big data requirements and how the team architected our managed service offering, highlighting the best practice of providing an “In-VPC” deployment. We hope other infrastructure software startups can benefit from sharing some of the key learnings that led to the launch of Ahana Cloud for Presto on AWS.

For some more background, Presto is an open source system for federated data analytics. Federation means the system can map multiple data stores. It enables users to access data where it lives in a wide variety of sources via federated plug-in connectors without moving or copying the data. Presto was originally developed by Facebook. Today, it’s deployed in large-scale production at some of the world’s most data-driven companies, including Uber and Twitter. Presto addresses the business need of leveraging all data within an organization to generate insights and drive decision-making faster than ever before. Presto also leads in delivering on the technology trends of today: disaggregation of storage and compute, resulting in the rise of Amazon S3-based data lakes and on-demand cloud computing. You can learn more on the AWS Presto page.

While the SQL engine is the main component of an interactive ad hoc analytics system, the other components, such as the metadata catalog, the data sources, and the visualization tools or notebooks, require integration. Deploying and managing complex software in AWS can be challenging. Presto administrators must set-up, configure, and connect one data store for Presto’s metadata. Typically, this is Apache Hive or AWS Glue. They must also create and configure their connectors to access their data sources and then configure catalog entries for each data source. Presto requires the admins to deal with many properties files to achieve this, which is both laborious and error-prone.

Ahana Cloud for Presto addresses these complexities and more with an easy-to-use cloud-native managed service. In 60 minutes or less, Ahana allows users to build an end-to-end deployment: multiple clusters of Presto, their Glue or Hive metadata catalogs, their AWS data sources, and user-facing tooling. Customers get the power of Presto with the capabilities of AWS for faster, more iterative, and interactive data discovery—without the complexity. Analysts, data engineers, and data scientists enjoy the freedom to rapidly use any data in the organization and do so in a more self-service way. Additionally, AWS customers can procure services the way they’re used to—quickly and easily—on an hourly pay-as-you-go (PAYGO) listing on AWS Marketplace, simply billed to their AWS accounts.

The “In-VPC” Deployment Approach that Data-Driven Customers Want

As cloud service adoption has grown, the way companies store and analyze their data has evolved. Early adopters were focused around innovation: building and deploying applications quickly with AWS and other public cloud providers. Most of the mission-critical data was still produced and analyzed in data centers, mainly due to control-related concerns of the data, such as where that data could be copied, how it could be used, and who could access it. Now, as cloud adoption has become mainstream, we see companies with the majority of their data both created and stored in the cloud, especially in cost-efficient Amazon S3-based data lakes. Along with this shift, so have the concerns related to how and where data is sent, its use, and access controls. Users do not want to lose control of their data; they prefer to not have to ingest it to other environments. They want data to remain in their own Virtual Private Cloud (VPC).

A new cloud-native architecture model has emerged for data-focused managed services like Ahana. We call it the “In-VPC” deployment model, separating the control plane from the compute and data planes.

The Role of the Control Plane

Ahana Cloud has two major “planes,” the control plane which is delivered as a SaaS, and the compute plane where Presto clusters run, which is delivered as a managed service. The Ahana Control Plane, just as it sounds, oversees, orchestrates, and manages the rest of the environment. The control plane runs in its own VPC, in the Ahana account separate from the customer account VPC, where the compute plane and data live. This makes management much easier without the customers having to share control of user data with Ahana. This is important as users want their data to remain in their own VPC and not be ingested in any other environment (e.g., some 1st gen cloud data warehouse services). In fact, the Ahana control plane running in the Ahana VPC never sees any of the customer’s data; it is totally separate from the customer’s “In-VPC” compute plane deployment.

Integrated Metastore

For further ease-of-use, Ahana pre-integrates an Apache Hive metastore/catalog, which is automatically created, so it’s not essential to set-up other components like AWS Glue. But if users have an existing metastore including Glue, they can use that if they prefer.

Connectors Included

In terms of connectors, Ahana initially ships with support for AWS data services like Amazon S3 and Amazon RDS for MySQL and PostgreSQL, and others. More connectors for sources like MongoDB and Amazon Redshift will follow soon. Ahana automates the creation of connections and catalogs, removes the need to juggle configuration files, and eliminates the need for Presto restarts. Catalogs can be created once and used by multiple clusters.

Presto Ahana In VPC model diagram

In the diagram, there are two core components, both created and managed by Ahana Cloud:

1. The Ahana control plane (top) and its UI orchestrates the Presto environment. There’s consolidated application logging, query logging, and monitoring, which means users have full and easy management and control. There are security and access controls and pay-as-you-go hourly billing and support.

  • The control plane runs in the Ahana Amazon account, external to the user’s environment.
  • Ahana and its employees have no access to the user’s data.
  • It is multi-tenant to scale with customer accounts.
  • The control plane supports SSO with Amazon Cognito, LDAP authentication, and SQL-based authorization for Presto (RBAC). In the future, there will be Apache Ranger support.

2. The Ahana compute plane (bottom) runs in each user’s VPC, deployed as a single-tenant environment within the user’s account. The control plane first creates a dedicated VPC for the compute plane. It then deploys Amazon EKS for a highly elastic, highly available environment to create Presto clusters. Once the control plane completes the initial set-up of the compute plane, users can create and manage any number of Presto clusters, which then get provisioned into the compute plane in Amazon EKS.

  • The compute plane, and the user data it interacts with, runs in the user’s account.
  • Each cluster is created in an individual node group to utilize the most advanced autoscaling and high-availability capabilities EKS provides.
  • Each Presto cluster comes pre-integrated with a Hive Metastore to store metadata for schemas and tables generated via Presto and an Amazon S3 data lake where data inserted into tables gets stored.
  • In addition to the pre-integrated catalog and Amazon S3 bucket, users can attach external Hive Metastores or AWS Glue catalogs pre-populated with metadata for structured data stored in Amazon S3 and databases running on Amazon RDS for MySQL or PostgreSQL.

This separation of the control, compute, and data planes is enabled by Amazon’s recommended approach of cross-account access via external ID – a mechanism that uses trusted secure token exchange. Users simply update their policy to include the Ahana ARNs (Amazon Resource Names).  The In-VPC deployment approach offers greater security and cleaner management to users. For further details, we recommend this AWS blog on architecting successful SaaS services.

Summary

Ahana Cloud for Presto is the fully managed, end-to-end environment for Presto. It gives users an interactive multi-cluster UI with single-click cluster and data source management. It provides automatic set-up, security features, and resilience features. It leverages the “In-VPC” deployment, which separates the control, compute, and data planes for customers. Finally, Ahana is procured using a simple and affordable pay-as-you-go usage-based licensing model on AWS.

Sign up for free Early Access for Ahana Cloud today!

Ahana Cloud for Presto: Analytics for Disaggregated Stack

og_laptop_ahana

Today I am incredibly excited to announce Ahana Cloud for Presto, the first fully integrated, cloud-native managed service for Presto – that simplifies the ability of cloud and data platform teams of all sizes to provide self-service, SQL analytics for their data analysts and scientists. Before I share more about what Ahana Cloud for Presto does, I’d like to share why we built it and the problems it solves. 

Data warehousing emerges

Data warehousing emerged in the 90s as the internet was surging in popularity and data was burgeoning. The new competitive, constantly-changing global economy required greater business intelligence, and there was a broad realization that data needed to be integrated to provide the critical business insights for decision-making. Teradata, Oracle, Microsoft SQL Server and IBM DB2 warehouses exploded. That requirement for greater business insights has only grown since and Snowflake’s tremendous IPO last week showed how data warehousing has evolved and now moved to the cloud. 

ingest everything into a data warehose

The first step of data warehousing involves ingesting all data continuously and constantly into a single database. 

Once data is in the data warehouse, you can query it and report on it. Typically, these systems are closed source, with data stored in proprietary formats. Because of the technology and data lock-in, these systems are also very expensive. 

An alternative architecture arises – the open federated, disaggregated stack 

Over the past 5 years, while the traditional data warehousing approach of a tightly coupled database continued to be adopted, an alternative approach started to be widely adopted by the most innovative technology companies – Facebook, Twitter, Uber, Netflix and others. A loosely coupled disaggregated stack that enabled querying across many databases and data lakes became the dominant standard for their analytics – with the tightly coupled data warehousing approach relegated to legacy workloads. 

This new SQL analytics stack is made of 4 elements – the query engine, metadata catalog, transaction manager, and storage engine. And Presto has emerged as the defacto query engine. Presto is a federated, distributed query engine created and open sourced by Facebook. It is designed to be extensible and pluggable, which led to its extensive connector ecosystem. 

This image has an empty alt attribute; its file name is ilKr6_G55mqr4K9tsab5VbdNgvgRCBQlJai6csiG1FuiumZRZCDmEeaaGlZ3MWONIvEXfaWtCLUYNUxA50NJKjKuPVSjUs9TpulVzkJs7vBURUw6F5PEMubnaqC2FaHbtAnenbSR

But why? Why is this disaggregated stack with Presto as the foundation the preferred choice for the most advanced technology companies who can afford to buy products off the shelf?  

First, this federated, disaggregated stack addresses the new realities of data

  • There is just too much data being generated and a single database is no longer the solution to support a wide range of analytics 
  • Data will be stored in data lakes, but other pertinent data will still reside in a range of other databases
  • SQL analytics is needed for both the data lake where raw data resides in cheap storage as well as the broad range of other databases data continues to live in

Second, this federated, disaggregated stack is open 

Open source – PrestoDB under the Linux Foundation is completely open source under the Apache 2.0 license. This means that you benefit from the best innovations, not just from one vendor but from the entire community. 

Open formats – PrestoDB doesn’t use any proprietary formats. In fact, it supports most of the common formats like JSON, Apache ORC, Apache Parquet and others.

Open interfaces – PrestoDB is ANSI SQL compatible. Standard JDBC / ODBC drivers can be used to connect to any reporting / dashboarding / notebook tool. And because it is open source, language clauses continue to be added in and expanded on. 

Open cloud PrestoDB is cloud agnostic and because it runs as a query engine without storage natively aligns with containers and can be run on any cloud. 

Technology companies prefer this open approach compared to the proprietary formats and technology lock-in that come with the traditional data warehousing approach.

Why isn’t the open federated, disaggregated stack with Presto ubiquitous? 

This is the question I asked myself over the past year. As I talked with hundreds of data engineers and platform engineers, it became crystal clear. 

The power of Presto is fantastic, but still out of reach of many platform engineering teams who may not have the time or skills required to manage Presto. Born in the Hadoop world, Presto is still complex.  It’s a distributed data system with extensive configuration, tuning, integration and management required. Managing it on top of containers and systems like Kubernetes makes it even more challenging. While some companies – particularly large Internet ones – enable self-service SQL analytics across many data sources, including both data lakes and databases, many others have not yet been able to do so given the complexity of these activities.  This is what Ahana Cloud solves.

Introducing Ahana Cloud 

Ahana Cloud for Presto–the first fully integrated, cloud native managed service–simplifies the ability of cloud and data platform teams to provide self-service, SQL analytics for an organization’s analysts and scientists.

Easy

0 to Presto in 30 minutes including the in-VPC AWS service in your own account with the Ahana Console

Fully integrated 

Ahana comes with a built-in catalog and easy integration with data sources, catalogs and dashboarding tools

Cloud Native

Runs on Amazon Elastic Kubernetes Service for high scalability, availability and manageability

Learn more about Ahana Cloud for Presto here

Just the beginning 

This is just the beginning for us at Ahana. The combination of a cloud managed service with federated SQL analytics has opened up an enormous set of possible innovations to simplify analytics for platform teams – with the eventual goal of being self-managed and self-healing.  I am excited that Ahana makes the power of Presto widely accessible and achievable to data platform teams of every size and, at the same time, contributes back to the community and open source. 

Are you ready to go from 0 to Presto in 30 minutes? 

Sign up now to get started

Bridging the gap to bring two Presto communities together: Welcoming Starburst Data to the Presto Foundation

Steven Mih, Co-Founder & CEO

This week at Ahana we announced our company launch and vision to further grow and evangelize the PrestoDB community alongside the Linux Foundation and Presto Foundation with founding members Facebook, Uber, Twitter, and Alibaba. Also this week, Starburst Data shared a blog announcing that they joined the Presto Foundation. 

As news of Ahana and PrestoDB circulated in outlets like ZDNet, Datanami, and many more, we kept hearing the same question come up, one that community members like Thomas Spicer at OpenBridge asked in his recent blog: Why are there two Presto projects and how many do we need?

To provide context, today there are two separate Github repos, two Slack channels, two websites for Presto, and two foundations. First, there’s the original PrestoDB with Linux Foundation’s Presto Foundation. Second, there’s the similarly-named fork PrestoSQL with Presto Software Foundation, which was started and controlled by the new co-founders of Starburst Data. Whoa. 

You may be thinking, “wow, what a hot mess!” And you wouldn’t be alone. I’ve talked with many developers who feel similar and just want to code without all the confusion! Gale Hashimoto & Chiara Portner at Hopkins Carly recently blogged about open source project naming conventions and how they often advise developers. 

Fortunately, the situation is looking like it will soon be resolved.  

The Linux Foundation is one of the most experienced organizations in helping bring together developer communities. Arguably Linux Foundation is the spiritual center of open source, along with Apache Software Foundation. And the Linux Foundation has achieved unity numerous times, some examples being with Linux itself and with container image formats via the Linux Foundation’s CNCF. 

While some of this may feel like “yawn-inducing inside-baseball”, in my view it matters a whole lot. Just look at what the Linux and Kubernetes projects have achieved for the greater good of developers worldwide. 

Transparency is one of the key tenants of the Linux Foundation’s Presto Foundation (see their three main principles in the image below). Underlying the principle of a united community is an idea that software development needs to thrive, and that isn’t the case when efforts are duplicated across multiple project code bases.

Since late December of last year, I’ve been aware of many meetings and continuing efforts between Linux Foundation’s Presto Foundation and the Presto Software Foundation to align with the above principles. Over the last few days however, a breakthrough in bridging these two communities has occurred. While we won’t know the reasons for some time, I suspect that Ahana may have been the catalyst.  

I look forward to welcoming Starburst to the Presto Foundation and for the benefit of the community, I hope we can see the confusion end with: 1 foundation, 1 primary code base for new development, and 1 community. 

Getting Perspective on Open Source Projects and Forks with Ted Dunning, Apache Software Expert

We spent some time with Apache Software Foundation board member, PMC member, and open source expert Ted Dunning to get his thoughts on the state of open source software. 

Can you start by telling us a bit about yourself and your open source background?

I’ve been involved with machine learning and big data systems for a very long time and have spent even longer working in the open source world. The first open source I was involved with was in the mid-70’s, which at the time was guys trading floppy disks in a coffee shop parking lot in Colorado. More recently I’ve been with the Apache Software Foundation (ASF) since 2007. 

What’s your general philosophy on open source and the community behind an OSS project?

In my view, it’s not just one part or a few individuals that create an open source community – it’s the community as a whole. Sure, you might have a few more vocal folks, but an open source community is only as good as its least vocal member. And the community is a great thing because it gives people the ability to be part of a community – a wonderfully electrifying experience. It’s a unifying experience, and it is the ultimate marketing. 

There are cases where people pretend to be open but really won’t really let anyone else play. If a community is done cynically, it can turn into vast bitterness. And it’s very common for companies to say “we employ the creators of this project”, but in fact I think that it is the community in aggregate who wrote it. Yes, a few people may have written the first lines of code and may have gotten the ball rolling, but we have to remember that the project is not just the source code. There are always a lot of different contributions to the creation of a project.

And you’ve been a part of several open source companies, MapR being the latest as the CTO. How did you strike the balance of open source and proprietary?

MapR was actually very, very interesting because it had open and closed parts. But they were good about finding a balance – it wasn’t just all proprietary software or all open source. I tried to be very publicly clear about our decisions about open sourcing some features and keeping others proprietary. One of the challenges of a vendor-backed open source technology is that you’re going to have a culture gap inside the company. You have people come from closed source, and you have people that come from open source, and neither side necessarily understands the other. But despite that, I felt much better being straightforward and honest about our decisions rather than pretending that everything was open.

When it comes to building open source communities around projects, I’m sure you’ve seen many times that things can go wrong. In some cases that might mean a project splits off, or forks, from the original project. In your experience, what can that mean for a community?

It’s so sad when communities split up. When these destructive forks happen, quite frankly it’s a huge tragedy for open source. The common characteristics of these cases are that a) everyone is just confused and b) there typically are major differences in worldviews of the same events on either side of the fork, making it very hard to get people together when there’s no shared understanding of what was happening. I see these two specific things happen when it comes to bad forks:

  1. Both projects keep the same name or too similar of a name, so no one knows the differences between the two
  2. Splitting the project splits the community, and this may render both irrelevant

If you split a community, a mailing list, a slack channel, there’s a reverse of synergy effect. A lot of good people who want to contribute to something but don’t really like conflict end up just leaving. People have plenty of distractions in their lives. And they have plenty of other opportunities to be creative, fun, and happy. Then like 10 years later they go, “Oh, I remember that, wonder what happened to it?”

What’s a specific example of a fork that went wrong? 

Sun’s Open Office/LibreOffice project split was a classic fork that split the community. As Sun’s contributions to the project declined, there were concerns from the Open Office community over whether it would remain under a neutral foundation. When Oracle acquired Sun, discussions of a fork began and as a result, LibreOffice was created and the community was split between the two projects. Unfortunately, it created a lot of conflict and negativity which harmed both products and the Open Office/LibreOffice community at large. Much of the community stopped contributing and maintaining either.

The Hudson/Jenkins fork is a fork that had a similar potential for problems, but essentially the entire community left the original Hudson branch and moved over to the Jenkins fork. Today, it is hard to see that there ever was a fork. Things that went right included distinctive naming (Hudson versus Jenkins) and a strong consensus about which fork to stay with moving forward (Jenkins). The original Hudson project moved to the Eclipse foundation, but has been dormant since 2017, leaving room for Jenkins to flourish.

What about projects that keep too similar of a name? We’re seeing that play out today with the Presto open source project with the original PrestoDB and the similarly-named fork PrestoSQL. What are your thoughts on this?

My initial reaction with Presto is that I always have to go look up which one is which. There’s PrestoSQL, PrestoDB, Presto Foundation, Presto Software Foundation and I can’t keep them straight. And so there’s massive confusion unless you’re one of the participants – as an outsider, I have no idea which one is which. From what I understand about the project, my guess is that the Presto creators who worked at Facebook didn’t make it apparent that they were going off to start a new community. They may not have been as open as they could have been about this, as sometimes happens when people leave a company. Later Facebook decided to put more effort behind the community, and it caused confusion because there’s now effectively two communities with nearly identical names.

But I think that the moral of the story is that when you name something, names matter. Anybody who forks should have a completely different name. That alone makes it easier for everyone to understand. Of course, it isn’t always clear which is the fork and which is the original, but at least you’d know if it was Presto in one place and Abracadabra in the other. Everybody would get it. It doesn’t even matter all that much which side gets which name.

Introducing Ahana

Dipti Borkar, Co-Founder & CPO and Steven Mih, Co-Founder & CEO

There are many factors that contribute to strong co-founder relationships. Forbes refers to one set and defines them as the 3Ts: The first one being ‘trust’, which is self explanatory, second being ‘talk’, that is the ability to talk things out even if you disagree, and the third being ‘target’, that is having one mission with aligned goals. While it’s easy to put this into a framework, getting this right is something rare. For us, it took nearly 10 years, joint experiences at two companies and working through numerous tough situations and tough conversations. The outcome is Ahana.

While we’re heads down working on Ahana, we’d like to share our vision and the problem we intend to solve. Our vision is to simplify the interactive, ad hoc analytics journey for users. But hasn’t this problem been solved already? Let’s walk through the evolution of data systems for a minute or two. 

An architectural shift 

It has been nearly 15 years since the Google MapReduce paper was published (2004) and Amazon Web Services was launched (2006). These were key milestones that marked the beginnings of two of the biggest trends in enterprise software that continue on even today. In these 15 years a lot has changed. Data is stored in a lot more places than the three early relational databases (IBM DB2, Oracle, Microsoft SQL Server). In fact with polyglot persistence, organizations have different data systems for each use case. What this also means is that the metadata that used to be unified as the star schema of the sacred data warehouse, is now also spread across a variety of data sources.  

In addition, the five key components of a database as defined in database textbooks are now independently running pieces of software in the big data stack. The database stack is completely disaggregated – the query processor (examples: Apache Hive, Apache Spark, Presto), the metadata catalog (examples: Hive Metastore, AWS Glue), the storage engine (examples: object stores, AWS S3, Google GCS, Azure ADLS, other RDBMSes) and even the transaction manager (examples: Apache HUDI, DeltaLake). This separation brings great flexibility at the cost of tremendous complexity. 

Source: Textbook: Architecture of a Database System, Hellerstein, Stonebraker, Hamilton

The largest internet companies have some variation of these components in production at massive scale.  Big platform teams with the brightest engineers work on integrating these components together, innovate on them, deploy them and maintain them. And while it is complex and resource intensive, they get tremendous value by enabling data-driven decisioning with ad hoc analytics platforms. 

But every company and organization should have the ability to make interactive, ad hoc data-driven decisions without the need to integrate, manage and deploy a complex stack.

Data Federation makes a come back 

The heart of the data stack is the query processor, the query engine. And the good news is that there has been immense innovation in the open source community on modern standalone query engines. In fact, the idea of a separated query engine isn’t new. Data Federation has had many iterations over time.  

Data Federation 1.0 started with the ACM Federated Architecture paper by McCleod and Heimbigner (1985) History rhymed again with Data Federation 2.0 in early 2000s with the founding of Composite Software and the Garlic Paper on DB2 Federation (2002) being published. But there were still too few data sources to query against and the stack wasn’t disaggregated which reduced the need for federation. 

Data Federation 3.0 made a comeback with Google’s Dremel paper in 2010. Couchbase, where we both worked for many years, implemented SQL++, another federated query engine designed at the UC San Diego database lab. Presto was designed and built at Facebook and open sourced in 2013. Since then, its adoption has simply exploded. 

Facebook and the creators of Presto – Martin, David and Dain, who were engineers there, have built a nicely designed distributed system (Presto IEEE paper) –  highly modular, pluggable and extensible. While there is room for improvement (like moving away from a single coordinator node, better resource management, a more advanced planner that reads less data and pushes down more work), Presto has become the de facto choice for interactive, ad hoc querying on a disaggregated stack

Community-driven Presto and Presto Foundation

We have both learned from our enterprise software experience that developers and engineers primarily care about solving problems in order to get things done. As the use of an open source software grows in importance to an organization, the developers also care about a project’s transparency, openness and neutrality. Open source software has moved from its fringe beginnings to the forefront of technology innovation. While both of us have participated in the evolution of open source commercialization from support-only models to open-core based proprietary subscription models even extending to infrastructure software as a service, the common underlying factor to all these models is the requirement of a strong, vibrant open source community. Achieving that requirement requires more than simply an open source license like Apache 2.0. It takes a thoughtful, transparent, and authentic approach in all interactions. We believe that is open source done right. 

At Alluxio, where we both worked recently, we became very involved with the open source Presto community. Dipti presented many joint talks with founders of Starburst Data as well as pushed out product offerings integrated with Starburst Presto. Steven engaged with many end user companies to evaluate and deploy Presto with Alluxio into production environments.

We soon realized that there was a lot more to the Presto community than we knew. There were in fact two separate Github repos, two slack channels and two websites. All things very, very confusing for any open source community. Then in September 2019, Facebook donated the original project, PrestoDB to the Linux Foundation, to further grow and evangelize the community under an established open source governance model similar to CNCF and Kubernetes. We joined the Presto Foundation (Steven, a member of the Governing Board and Dipti, as chairperson of the Outreach Committee) to evangelize and support Presto. 

A new adventure begins

We had talked about founding a company a few times in the past, and seeing the ever-growing problems with disparate data systems, combined with the federated query engine returning to the forefront, we have embarked on a new venture. We believe that data federation 3.0 with Presto will become the architectural foundation to meet the needs of modern data teams. 

So here we are. We are excited to share that Ahana has raised $2.25 million in funding led by GV (formerly Google Ventures) along with participation from Leslie Ventures and other angel investors. We’re thrilled to have Dave Munichiello from GV to be our lead investor. He and the whole GV team have continued to be fantastic in their support of our vision. We are excited to build out our technical team and deliver simplified Presto-based analytics products for every organization. Stay tuned in by joining our All Things PrestoDB newsletter

We’ve worked together at two companies, we have a great working relationship, and we’re passionate about bringing open source products to market. This time, we’re thrilled to start from the ground up, as friends and as co-founders of Ahana. Cheers!

San Francisco, April 2020 – Dipti Borkar and Steven Mih sosh-dist-celebrating the closing of the Ahana seed round in Steven’s garage office with the leaf blower