Presto SQL Syntax: Learn to Write SQL Queries in Presto

Presto is powerful, but running it on your own can get complicated. If you’re looking for a managed Presto experience that can let you focus on querying your data rather than managing infrastructure, try Ahana Cloud today.

PrestoDB uses regular ANSI SQL to query big data stored in object storage. If you’ve used SQL in databases before, you should be very comfortable using Presto. However, there are some quirks you need to keep in mind, stemming from the fact Presto is typically used to query semi-structured storage such as Amazon S3 rather than relational databases.

Below you’ll find some of our most popular resources relating to writing Presto SQL

Working with Date and Time Data

Working With Different Data Types

Manipulating Tables and Data

Additional Presto SQL Resources

what is amazon redshift

Top 4 Amazon Redshift Alternatives & Competitors

Introduction

In the last article we discussed the fundamental problems with Amazon Redshift. To add to that article, we’ll provide some information about where to look, if you’re starting to explore new options. We’ll unearth some of the available Redshift Alternatives, and why they are worth looking into. Disclaimer – this is not an ordinal layout.

1. Ahana

Ahana offers the only managed service for Presto as a feature-rich, next-gen SQL query engine in Ahana Cloud. It plays a critical role for data platform users searching for something with ease-of-use, full integration, and a cloud native option for their SQL engine on their AWS S3 data lakes; as well as other data sources. Ahana Cloud has everything the user will  need to get started with SQL on the Open Data Lakehouse. It’s a great choice for a Redshift alternative, or even to augment the warehouse.

 Currently, Ahana is offering a free trial for their enterprise solution, as well as a free community edition.

2. BigQuery

BigQuery is another great AWS Redshift alternative. It’s a cloud data warehouse to ingest and process queries at scale onGoogle Cloud Platform. If you’re on Google Cloud, it doesn’t require much effort to integrate it with other Google products.

You can run queries or analyze terabytes of data in seconds. BigQuery allows the user to leverage the power of Google’s infrastructure to load data. They user can also use Google Cloud Storage to bulk load your data, or stream it in bursts of up to a thousand rows per second.

It’s supported by the BigQuery REST API that comes with client libraries like Java, PHP, and Python. While BigQuery is the most proven tool on this list, it’s not the easiest to use. If your team lacks an experienced data engineer, you’re going to have challenges as the learning curve is significant.

BigQuery pricing: query based on the amount of data processed at $5 per TB (and includes one free TB per month).

3. Azure SQL Data Warehouse

as a Redshift alternative Azure is a good choice. Azure SQL Data Warehouse is perfect for large businesses dealing with consumer goods, finance, utilities, and more. As one of the most used services on Microsoft Azure, it’s a SQL server in the cloud but is fully managed and more intelligent.

Now absorbed into Azure Synapse Analytics, it’s a powerful cloud-based analytics platform you can use to design the data structure immediately (without worrying about potential implementation challenges). Its provisioned resources also allow users to query data quickly and at scale.

If you’re not familiar with the Azure environment, you’ll have to invest some time in understanding it. As it’s fully featured and well documented, there’s enough support to get you over the learning curve.

Like Redshift and Snowflake, Azure Synapse also follows a consumption-based pricing model. So, it’s best to have an experienced data engineer on-board to make “reasonably accurate guesstimates” before committing.
Azure pricing: follows an hourly data consumption model (and offers a 12-month free trial).

4. Snowflake

Like Redshift, Snowflake is a robust cloud-based data warehouse built to store data for effortless analysis. Snowflake is a good Redshift alternative as it was developed for experienced data architects and data engineers, Snowflake leverages a SQL workbench and user permissions to allow multiple users to query and manage different types of data.

Snowflake also boasts robust data governance tools, security protocols, and the rapid allocation of resources. While the platform is powerful and efficient at managing different data types, it still proves to be a significant challenge for users who don’t hail from a strong data background.

Snowflake also lacks data integrations, so your data teams will have to use an external ETL to push the data into the warehouse. Whenever you use third-party tools, you’ll also have to consider the extra costs and overheads (such as setup and maintenance costs) that come with them.

Snowflake follows a consumption-based pricing model similar to that of Redshift. This is great for experienced users who can make an educated guess about this data consumption. Others may have to deal with an unpleasant surprise at the end of the billing cycle.For a more in-depth look into Snowflake as a competitor, check the Snowflake breakdown.

Snowflake pricing: based on a per-second data consumption model (with an option of a 30-day free trial).

Test out the Alternative

Ready to see the alternative in action?

Using a Managed Service for Presto to as a Redshift Alternative

Redshift, while a fantastic tool, does have some significant issues the user is going to have to overcome. Below are the most frequently stated causes of concern expressed by Amazon Redshift users and the catalyst to search out Redshift alternatives:

Price-Performance

Redshift gets expensive quickly. As data volumes increase, the cost of storage and compute in the warehouse becomes problematic. Redshift comes with a premium cost, especially if you use Spectrum outside of AWS Redshift.A solution to this is to reach for a tool focused on reducing overhead cost. As a biased example, Ahana Cloud is easy to run and allows the users to only pay for what they use without upfront costs. Simply, the performance you’re used to and pay less for it!

Closed & Inflexible

 Working with a data warehouse, while having some perks, comes with some drawbacks. In this environment the user loses their flexibility. Data architects, data engineers, and analysts are required to use the data format supported by the data warehouse. Redshift does not provide or utilize flexible or open data formats available.

Other modern solutions allow the user to define and manage data sources. Ahana permits data teams to  attach or detach from any cluster with the click of a button; also taking care of configuring and restarting the clusters.

Vendor Lock-in

One of the biggest pain points, and desire to find Redshift alternatives is due to vendor lock-in. Data warehouse vendors, like AWS Redshift, make it difficult to use your data outside of their services. To do so data would need to be pulled out of the warehouse and duplicated, further driving up compute costs. Use the tools & integrations you need to get value from your data without the proprietary data formats. Head over to this additional comparison for a discernible solution dealing with vendor lock-in, price-performance, and flexibility.

Redshift Alternatives

Summary: Redshift Alternatives

A warehouse provides an environment that fosters the ability to do drill-down analysis on your data looking for If you are using Amazon Redshift now, and are looking to solve some of the problems with it, check out this  on-demand webinar providing instructions to augment your Redshift warehouse with an Open Data Lakehouse. This webinar also explains why so many of today’s companies are moving away from warehouses like Snowflake and Amazon Redshift towards other Redshift alternatives – specifically, a SQL Data Lakehouse with Presto.

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine for data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences in this article

what is amazon redshift

The Fundamental Problems with Amazon Redshift

In the last article we discussed the Difference Between the data warehouse and Redshift Spectrum. To continue on this topic, let’s understand the problems with Amazon Redshift and some of the available alternatives for that data teams can explore further.

Amazon Redshift made it easy for anyone to implement data warehouse use cases in the cloud. However, It is unable to provide the same benefits to newer, more advanced cloud data warehouses. When it was a relatively new technology, everyone was going through a learning curve.

Here are some of the fundamental Problems with Amazon Redshift:

redshift alternative

AWS Redshift’s Cost

Amazon Redshift is a traditional MPP platform where the compute is closely integrated with the storage. The advantage of the cloud is that theoretically compute and storage are completely independent of each other, and storage is virtually unlimited. If you want more storage with this data warehouse you will have to purchase more compute power. As data volumes increase, the cost of storage and compute in the warehouse becomes challenging. AWS products, particularly the warehouse of topic and Spectrum come with a premium cost. This is especially true if you use Spectrum outside of AWS Redshift.This results in one of the most expensive cloud data warehouse solutions.

Vendor lock-in with Redshift

Data warehouse vendors, like AWS, make it difficult to use your data outside of their services. Data would need to be pulled out of the warehouse and duplicated, further driving up compute costs.

Proprietary data formats

Data architects, data engineers, and analysts are required to use the data format supported by the data warehouse. No flexible or open data formats available.

No Staging Area in Redshift

It is expensive to host data with Amazon, so duplication of data has to be avoided at all cost. In traditional RDBMS systems, we tend to have landing, staging layers and warehouse layers in the same database. But for Amazon’s data warehouse the landing and staging layer has to be on S3. Only the data on which reports and analytics will be built should be loaded in Redshift. This task should happen on a need basis rather than keeping the entire dataset in the warehouse

No Index support in Amazon Redshift

This warehouse does not support indexes like other data warehouse systems, hence it is designed to perform the best when you select only the columns that you absolutely need to query. As Amazon’s data warehouse is columnar storage, a construct called Distribution Key needs to be used, which is nothing but a column based on which data is distributed across different nodes of the clusters.

Manual house-keeping

Performance based issues that need to be handled in proper maintenance like Vacuum and Analyze, SORT Keys, Compressions, Distribution styles, etc. 

Tasks like VACUUM and ANALYZE need to be run regularly which are expensive and time consuming tasks. There’s no good frequency to run this that suits all. This requires a quick cost-benefit analysis before deciding on the frequency.

Disk space capacity planning

Control over disk space is a must with Amazon Redshift especially when you’re dealing with analytical workloads. There are high chances you oversubscribe the system, and not just reduced disk space degrades the performance of the query but also makes it cost prohibitive. Having a cluster filled above 75% isn’t good for performance.

Concurrent query limitation

Above 10 concurrent queries, you start seeing issues. Concurrency scaling may mitigate queue times during bursts in queries. However, simply enabling concurrency scaling didn’t fix all of our concurrency problems. The limited impact is likely due to the constraints on the types of queries that can use concurrency scaling. For example, we have a lot of tables with interleaved sort keys, and much of our workload is writes.

Conclusion

These were some of the fundamental problems vocalized by users that you need to keep in mind while using or exploring Amazon Redshift. If you are searching for more information about AWS products regarding challenges or benefits check out the next article in this series about AWS and query limitations check out this article.

Comparing AWS Redshift?

See how it the alternatives rank

Amazon Redshift Pricing: An Ultimate Guide

AWS’ data warehouse is a completely managed cloud service with the ability to scale on-demand. However, the pricing is not simple, since it tries to accommodate different use cases and customers.

AWS Redshift Query Limits

At its heart, it is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

what is amazon redshift

Redshift vs Redshift Spectrum: A Complete Comparison

Amazon Redshift is a cloud-based data warehouse service offered by Amazon. Redshift is a columnar database which is optimized to handle the sort of queries now running in enterprise star schemas and snowflake schemas.

Redshift Spectrum is an extension of Amazon Redshift. Redshift Spectrum as a feature of Redshift, allows the user  to query data available on S3. With Amazon Redshift Spectrum, you can continue to store and grow your data at S3 and use Redshift as one of the compute options to process your data (other options could be EMR, Athena or Presto.)

There are many differences between Amazon Redshift and Redshift Spectrum, here are some of them:

Architecture

What is Amazon redshift
Image Source: https://docs.aws.amazon.com/

Amazon Redshift cluster is composed of one or more compute nodes. A cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external communication. The client application interacts directly only with the leader node and the compute nodes are transparent to external applications.

What is redshift spectrum
Image Source: aws.amazon.com

Whereas Redshift Spectrum queries are submitted to the leader node of your Amazon Redshift cluster., The Amazon Redshift compute nodes generate multiple requests depending on the number of objects that need to be processed, and submits them concurrently to Redshift Spectrum. The Redshift Spectrum worker nodes scan, filter, and aggregate your data from Amazon S3 for processing back to your Amazon Redshift cluster. Then, the final join and merge operations are performed locally in your cluster and the results are returned to your client.

Redshift Spectrum is a service that uses dedicated servers to handle the S3 portion of your queries. The S3 Glue catalog service is used to maintain the definition of the external tables. Redshift loosely connects to S3 data by the following route:

 AWS Redshift Spectrum alternatives

External database, schema, and table definitions in Redshift use an IAM role to interact with the Presto catalog and Spectrum, which handles the S3 portion of the queries.

Use case 

Amazon Redshift is a full-managed data warehouse that is efficient in storing historical data from various different sources. This tool is designed to ease the process of data warehouse and analytics. 

Redshift Spectrum is used to perform analytics directly on the data in the Amazon S3 cluster using an Amazon Redshift node. This allows users to separate storage and compute.  The user can scale them independently.

You can use Redshift Spectrum, which is an add-on to Amazon redshift, for its capability to query the data from the files of S3 with existing information from the Redshift data warehouse. In addition to querying the data in S3, you can join the data from S3 to tables residing in Redshift.

Performance

Because Amazon Redshift holds dominion  over how data is stored, compressed and queried, it has a lot more options for optimizing a query. On other hand Redshift Spectrum only has control over how the data is queried (because  it is up to AWS S3 how it’s stored). Performance of Redshift Spectrum depends on your Redshift cluster resources and optimization of S3 storage.

That said, Spectrum offers the convenience of not having to import your data into Redshift. Basically you’re trading performance for the simplicity of Spectrum. Lots of companies use Spectrum as a way to query infrequently accessed data and then move the data of interest into Redshift for more regular access.

Conclusion

This article provides a quick recap of the major differences between Amazon Redshift and Redshift Spectrum. It takes into consideration today’s data platform needs. 

Simply, Amazon Redshift can be classified as a tool in the “Big Data as a Service” category, whereas  Amazon Redshift Spectrum is grouped under “Big Data Tools”. 

If you are an existing customer of Amazon Redshift and looking for a best price per performance solution to run SQL on an AWS S3 data lake then try our community edition or 14-day free trial

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand. However, the pricing is not simple, since it tries to accommodate different use cases and customers.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

redshift alternative

How to Use AWS Redshift Spectrum in AWS Lake Formation

As we’ve covered previously in What is Redshift Used For?, AWS Redshift is a cloud data warehouse used for online analytical processing (OLAP) and business intelligence (BI). Due to Redshift’s coupled architecture and relatively high costs at larger data volumes, businesses often seek to limit the workloads running on Redshift, while utilizing other analytic services including open-source Presto as part of a data lake house architecture.

Lake Formation makes it easier to set up the data lake, and to incorporate Redshift as part of the compute layer alongside other analytics tools and services. Developers can optimize their costs by using AWS Redshift for frequently accessed data and move less frequently accessed data to the Amazon S3 data lake, where it can be queried using serverless query engines such as Athena, Ahana, and Redshift Spectrum.

Two main reasons you would want to use Redshift with Lake Formation:

  • Granting and revoking permissions: Within Lake Formation, there is an independent permissions model in addition to the general IAM permissions set on an AWS account. This enables granular control over who can read data from a lake. You can grant and revoke permissions to the Data Catalog objects, such as databases, tables, columns, and underlying Amazon S3 storage. With Redshift following the Lake Formation permissions model out-of-the-box, you can ensure that the users querying data in Redshift are only accessing data they are meant to access. 
  • Creating external tables and running queries: Amazon Redshift Spectrum can be used as a serverless query option to join data stored in Redshift with data residing on S3. Lake Formation allows you to create virtual tables that correspond to S3 file locations and register them in the Data Catalog. A Redshift Spectrum query would then be able to consume this data without additional configuration.

How to Integrate AWS Redshift in Lake Formation

Lake Formation relies on the AWS Glue Crawler to store table locations in the Glue Data Catalog, which can then be used to control access to S3 data for other analytics services, including Redshift. This AWS blog post suggests a reference architecture for connecting the various services involved:

  • Data stored in an Amazon S3 lake is crawled using AWS Glue Crawler.
  • Glue Crawler then stores the data in tables and databases in the AWS Glue Data Catalog.
  • The S3 bucket is registered as the data lake location with Lake Formation. Lake Formation is natively integrated with the Glue Data Catalog.
  • Lake Formation grants permissions at the database, table, and column level to the defined AWS Identity and Access Management (IAM) roles.
  • Developers create external schemas within Amazon Redshift to manage access for other business teams.
  • Developers provide access to the user groups to their respective external schemas and associate the appropriate IAM roles to be assumed. 
  • Users now can assume their respective IAM roles and query data using the SQL query editor to their external schemas inside Amazon Redshift.
  • After the data is registered in the Data Catalog, each time users try to run queries, Lake Formation verifies access to the table for that specific principal. Lake Formation vends temporary credentials to Redshift Spectrum, and the query runs.

Using Lake Formation as Part of an Open Data Lakehouse 

One of the advantages of a data lake is its open nature, which allows businesses to use a variety of best-in-breed analytics tools for different workloads. This replaces database-centric architectures, which requires storing data in proprietary formats and getting locked-in with a particular vendor.

Implementing Lake Formation makes it easier to move more data into your lake, where you can store it in open-source file formats such as Apache Parquet and ORC. You can then use a variety of tools that interface with the Glue Data Catalog and read data directly from S3. This provides a high level of flexibility, provides vendor lock-in, and strongly decouples storage from compute, reducing your overall infrastructure costs. (You can read more about this topic in our new white paper: The SQL Data Lakehouse and Foundations for the New Data Stack.)

If you’re looking for a truly open and flexible option for serverless querying, you should check out Ahana Cloud. Ahana Cloud and AWS Lake Formationmake it easy build and query secure S3 data lakes. Using the native integration, data platform teams can seamlessly connect Presto with AWS Glue, AWS Lake Formation and AWS S3 while providing granular data security. Enabling the integration in Ahana Cloud is a single-click affair when creating a new Presto cluster.

Learn more about Ahana Cloud’s integration with AWS Lake Formation.

Related Articles


Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

Data Warehouse: A Comprehensive Guide

Introduction

A data warehouse is a data repository that is typically used for analytic systems and Business Intelligence tools. It is typically composed of operational data that has been aggregated and organized in such a way that facilitates the requirements of the data teams. Data consumers need/want to be able to do their work at a very high speed to make decisions. By design, there is usually some level of latency involved in data appearing in a warehouse, keep that in mind when designing your systems and what the requirements are for your users. In this article, we’re going to review the data warehouse types, the different types of architecture, and the different warehouse model types.

Data Warehouse Architecture Types

The various data warehouse architecture types break down into three categories:

Single-tier architecture – The objective of this architecture is to dramatically reduce data duplication and produce a dense set of data. While this design keeps the volume of data as low as possible, it is not appropriate for complex data requirements that include numerous data sources.

Two-tier architecture – This architecture design splits the physical data from the warehouse itself, making use of a system and a database server. This design is typically used for a data mart in a small organization, and while efficient at data storage, it is not a scalable design and can only support a relatively small number of users.

Three-tier architecture – The three-tier architecture is the most common type of data warehouse as it provides a well-organized flow of your raw information to provide insights. It is comprised of the following components:

  • Bottom tier – comprises the database of the warehouse servers. It creates an abstraction layer on the various information sources to be used in the warehouse. 
  • Middle tier – includes an OLAP server to provide an abstracted view of the database for the users. Being pre-built into the architecture, this tier can be used as an OLAP-centric warehouse.
  • Top tier – comprises the client-level tools and APIs that are used for data analysis and reporting. 

Data Warehouse Model Types 

The data warehouse model types break down into four categories:

  1. Enterprise Data Warehouse

An EDW is a centralized warehouse that collects all the information on subjects across the entire organization. These tend to be a collection of databases as opposed to one monolith, that provides a unified approach to querying data by subject.

  1. Data Mart 

 Consisting of a subset of a warehouse that is useful for a specific group of users. Consider a marketing data mart that is populated with data from ads, analytics, social media engagement, email campaign data, etc. This enables the (marketing) department to rapidly analyze their data without the need to scan through volumes of unrelated data. A data mart can be further broken into “independent”, where the data stands alone, or “dependent” where the data is coming from the warehouse.

  1. Operational Data Store

The ODS might seem slightly counterintuitive to start with as it is used for operational reporting, and typically we don’t want to do reporting and analytic workloads on operational data. It is a synergistic component for the previously mentioned EDW and used for reporting on operational types of data. Low-velocity data that is managed in real-time, such as customer records or employee records, are typical of this kind of store.

  1. Virtual Warehouse

The Virtual Warehouse is maybe a questionable inclusion, but nonetheless important. This is implemented as a set of views over your operational database. They tend to be limited in what they can make available due to the relationships in the data, and the fact that you don’t want to destroy your operational database performance by having large numbers of analytic activities taking place on it at the same time.

Summary

A warehouse provides an environment that fosters the ability to do drill-down analysis on your data looking for insights. As a data analyst is looking for trends or actionable insights, the ability to navigate through various data dimensions easily is paramount. The warehouse approach allows you to store and analyze vast amounts of information, which also comes at a cost for storage and compute. You can mitigate some of these costs by optimizing your warehouse for data retrieval. Picking a DW design and sticking with it, and ensuring that your data has been cleansed and standardized prior to loading.
Another option to the warehouse is the growing data lake approach, where information can be read in place from an object store such as (AWS) S3. Some advantages are reduced costs and latency as the load to the DW is no longer necessary. The Community Edition of the Presto managed service from Ahana is a great way to try out the data lake to test your requirements.

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine for data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences in this article

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Typically a data warehouse contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from diverse data sources. It requires the process: Extract, Transform, and Load (ETL) from diverse data sources and create another copy within the data warehouse to support SQL queries and analysis. 

Following data design techniques are required to facilitate data retrieval for analytical processing:

Star Schema: It is the foundational and simplest schema among data warehousing modeling. It contains one or more fact tables indexing to any number of dimensional tables. Its graphical representation looks like a star hence why we call it a star schema. Fact tables are usually very large compared to dimension tables; and dimension tables can contain redundant data as these are not required to be normalized. 

Snowflake Schema: It is an extension of the star schema where a centralized fact table references the number of other dimension tables; however, those dimension tables are further normalized into multiple related tables. The entity-relationship diagram of this schema resembles a snowflake shape, hence we called it a snowflake schema.

Data Warehouse Example

Consider a fact table that stores sales quantities for each product and customer at a certain time. Sales quantities will be measured here and (Primary) keys from the customer, product, and time dimension tables will flow into the fact table. Additionally, all of the products can be further grouped under different product families and stored on a different table, the primary key of product family tables also goes into the product table as a foreign key. Such a construct is called a snow-flake schema as the product table is further snow-flaked into the product family.

snowflake schema design

Figure 1 explains the typical snowflake schema design. 

ETL or ELT—Extracting, Transforming, and Loading Data

Besides the difference in data modeling and schemas, building a data warehouse involves the critical task of ETL – the compiling of data into a warehouse from other sources.  

ETL v ELT diagram

In data extraction, we move data out of source systems. It could be relational databases, NoSQL databases or streaming data sources. The challenge during this step is to identify the right data and manage access control. 

In a data pipeline or batch workloads, we frequently move a large amount of data from different source systems to the data warehouse. Here the challenges are to plan a realistic SLA and to have a reliable and fast network and infrastructure. 

In data transformation, we format data so that it can be represented consistently in the data warehouse. The original data might reside in different databases using different data types or in different table formats, or in different file formats in different file systems. 

We load data into the fact tables correctly with an error-handling procedure in data loading.

Data Warehouse To Data Lake To Data Lakehouse

A data lake is a centralized file system or storage designed to store, process, and secure large amounts of structured, semistructured, or unstructured data. It can store data in its native format and process any variety of it. Examples of a data lake include HDFS, AWS S3, ADLS or GCS.

Data lakes use the ELT (Extract Load Transform) process while data warehouses use ETL (Extract Transform Load) process. With a SQL engine like Presto you can run interactive queries, reports, and dashboards from a data lake, without the need to create yet another data warehouse or copy of your data. and add an operational overhead. 

A data lake is just one element of an Open Data Lakehouse, as it is taking the benefits from both: a data warehouse and a data lake. However, an Open Data Lakehouse is much more than that. It is the entire stack. In addition to hosting a data lake (AWS S3), and a SQL engine (presto), it also allows for governance (AWS Lake Formation), and ACID transactions. Transactionality or transaction support is achieved using technologies and projects such as Apache Hudi; while Presto is the SQL engine that then sits on top of the cloud data lake you’re querying. In addition to this, there is Ahana Cloud. Ahana is a managed service for Presto, designed to simplify the process of configuring and operating Presto. 


As cloud data warehouses become more cost-prohibitive and limited by vendor lock-in, and the data mesh, or data federation, the approach is not performant, more and more companies are migrating their workloads to an Open Data Lakehouse. If all your data is going to end up in cloud-native storage like Amazon S3, ADLS Gen2, GCS. then the most optimized and efficient data strategy is to leverage an Open Data Lakehouse stack, which provides much more flexibility and remedies the challenges noted above. Taking on the task of creating an Open Data Lakehouse is difficult. As ab introduction to the process check out this on-demand presentation, How to build an Open Data Lakehouse stack. In it you’ll see how you can build your stack in more detail, while incorporating technologies like Ahana, Presto, Apache Hudi, and AWS Lake Formation.

Related Articles

5 Components of Data Warehouse Architecture

In this article we’ll look at the contextual requirements of a data warehouse, which are the five components of a data warehouse.

Data Warehouse: A Comprehensive Guide

A data warehouse is a data repository that is typically used for analytic systems and Business Intelligence tools. Learn more about it in this article.

Data Lakehouse

AWS Redshift Query Limits

What is AWS Redshift?

At its heart, AWS Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2. It has evolved and been enhanced since then into a powerful distributed system that can provide speedy results across millions of rows. Conceptually it is based on node clusters, with a leader node and compute nodes. The leader generates the execution plan for queries and distributes those tasks to the compute nodes. Scalability is achieved with elastic scaling that can add/modify worker nodes as needed and quickly. We’ll discuss the details in the article below.

Limitations of Using AWS Redshift

There are of course Redshift limitations on many parameters, which Amazon refers to as “quotas”. There is a Redshift query limit, a database limit, a Redshift query size limit, and many others. These have default values from Amazon and are per AWS region. Some of these quotas can be increased by submitting an Amazon Redshift Limit Increase Form. Below is a table of some of these quota limitations.

QuotaValueAdjustable
Nodes per cluster128Yes
Nodes per region200Yes
Schemas per DB per cluster9,900No
Tables per node type9,900 – 100,000No
Query limit50No
Databases per cluster60No
Stored procedures per DB10,000No
Query size limit100,000 rowsYes
Saved queries2,500Yes
Correlated SubqueriesNeed to be rewrittenNo

AWS Redshift Performance

To start, Redshift is storing data in compressed, columnar format. This means that there is less area on disk to scan and less data that has to be moved around. Add to that indexing and you have the base recipe for high performance. In addition, Redshift maintains a results cache, so frequently executed queries are going to be highly performant. This is aided by the query plan optimization done in the leader node. Redshift also optimizes the data partitioning in a highly efficient manner to complement the optimizations done in the columnar data algorithms.

Scaling

There are a robust number of scaling strategies available from Redshift. With just a few clicks in the AWS Redshift console, or even with a single API call, you can change node types, add nodes and pause/resume the cluster. You are also able to use Elastic Resize to dynamically adjust your provisioned capacity within a few minutes. A Resize Scheduler is also available where you can schedule changes, say for month-end processing for example. There is also Concurrency Scaling that can automatically provision additional capacity for dynamic workloads.

Pricing

A lot of variables go into Redshift pricing depending on the scale and features you go with. All of the details and a pricing calculator can be found on the Amazon Redshift Pricing page. To give you a quick overview, however, prices start as low as $.25 per hour. Pricing is based on compute time and size and goes up to $13.04 per hour. Amazon provides some incentives to get you started and try out the service.

First, similar to the Ahana Cloud Commnity Edition, Redshift has a “Free Tier”, if your company has never created a Redshift cluster then you are eligible for a DC2 large node trial for two months. This provides 750 hours per month for free, which is enough to continuously run that DC2 node, with 160GB of compressed SSD storage. Once your trial expires or your usage exceeds 750 hours per month, you can either keep it running with their “on-demand” pricing, or shut it down.

Next, there is a $500 credit available to use their Amazon Redshift Serverless option if you have never used it before. This applies to both the compute and storage and how long it will last depends entirely on the compute capacity you selected, and your usage.

Then there is “on-demand” pricing. This option allows you to just pay for provisioned capacity by the hour with no commitments or upfront costs, partial hours are billed in one-second increments. Amazon allows you to pause and resume these nodes when you aren’t using them so you don’t continue to pay, and you also preserve what you have, you’ll only pay for backup storage.

Summary

Redshift provides a robust, scalable environment that is well suited to managing data in a data warehouse. Amazon provides a variety of ways to easily give Redshift a try without getting too tied in. Not all analytic workloads make sense in a data warehouse, however, and if you are already landing data into AWS S3, then you have the makings of a data lakehouse that can offer better price/performance. A managed Presto service, such as Ahana, can be the answer to that challenge.

Want to learn more about the value of the data lake?

In our free whitepaper, Unlocking the Business Value of the Data Lake, we’ll show you why companies are moving to an open data lake architecture and how they are getting the most out of that data lake to drive their business initiatives.

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

What is AWS Redshift Spectrum?

Redshift Spectrum is a feature within Redshift that enables you to query data stored in AWS S3 using SQL. Learn more about its performance and price.

Ahana Cloud for PrestoDB

What are the Benefits of a Managed Service?

Managed Services – Understanding the basics

What are the operational benefits of using a managed service for Presto with Ahana Cloud? To answer this question, first let’s hear from an AWS Solution Architect about his experience using Ahana as a solution for his data lakehouse: “Ahana Cloud uses the best practices of both a SaaS provider and somebody who would build it themselves on-premises. So, the advantage with the Ahana Cloud is that Ahana is really doing all the heavy lifting, and really making it a fully managed service. The customer of Ahana does not have to do a lot of work. Everything is spun up through cloud formation scripts that uses Amazon EKS, which is our Kubernetes Container Service.”

The architect goes on to state, “the customer really doesn’t have to worry about that. It’s all under the covers that runs in the background. There’s no active management required of Kubernetes or EKS. And then everything is deployed within your VPC. So the VPC is the logical and the security boundary within your account. And you can control all the egress and ingress into that VPC.”

In addition to this the AWS architect continues to state, “this is beneficial. As the user, you have full control and the biggest advantage is that you’re not moving your data. So unlike some SaaS partners, where you’re required to push that data or cache that data on their side in their account, with the Ahana Cloud, your data never leaves your account, so your data remains local to your location. Now, obviously, with federated queries, you can also query data that’s outside of AWS. But for data that resides on AWS, you don’t have to push that to your SaaS provider.”

Now that you have that some context from a current user and a solution provided from this data architect, let’s get more specific about the reasons a user would want to select a managed service for their SQL engine for Data Lakehouse analytics and reporting.

For example, let’s say you want to create a a new cluster. It’s just a couple of clicks with Ahana Cloud, rather than an entire arduous process without the faciliation of a service. You can pick the the coordinator instance type and the Hive metastore instance type. And it is all flexible.

In this scenario, as to further progress with this illustration, instead of using the Ahana Cloud provided Hive metastore, you can bring your own Amazon Glue catalog. This allows the user to main control and streamline their tasks.

Then of course it’s easy to add additional data sources. For that, you can add in JDBC endpoints for your databases. Ahana has those integrated. After the connection, then Ahana Cloud automatically restarts the cluster.

When compared to EMR or with other distributions, this is more cumbersome for the user. All of this has to be manually completed by the user when they are not using a managed service:

  • You have to create a catalog properties file for each data source
  • Restart the cluster on your own
  • Scale the cluster manually
  • Add your own query logs and statistic
  • Rebuild everything when you stop and restart clusters
Managed service for Presto

With Ahana Cloud as a managed service for PrestoDB, all of this manual action and complexity is taken away, which in turn is allowing the data analysts and users to focus on their work – rather than spending a large amount of time being distracted with high labor processes and complicated configurations as a prerequisite to getting started with analytical tasks.

For scaling up, if you want to grow the analytics jobs over time, you can add nodes seamlessly. Ahana Cloud, as a managed service, and other distributions can add the nodes to the cluster while your services are still up and running. But the part that isn’t seamless or as simple, like with Ahana, is when you stop the entire cluster.

In addition to all the workers and the coordinator being provisioned, the configuration and the cluster connections to the data sources, and the Hive metastore are all maintained with Ahana Cloud. When you as tne user restart the cluster back up, all will come up pre-integrated with the click of a button. Meaning, the nodes get provisioned again, and you have access to that same cluster to continue your analytics service. T

Here this is noted as rather important. The reason for this is because otherwise the operator would have to manage it on your own, including the configuration management and reconfiguration of the catalog services. Specifically for EMR, for example, when you terminate a cluster, you lose track of that cluster altogether. You have to start from scratch and reintegrate the whole system.

Reduce Frustration When Configuring

See how Ahana simplifies your SQL Engine

Serverless SQL Engine

Next Steps – Exploring a Managed Service

As you are your team members are looking to reduce the friction from your data analytics stack, learn how Ahana Cloud reduces frustration and time spent configuring for data teams. The number one reason for selecting a managed service is that it will make your life easier. Check out our customer stories to see how organizations like Blinkit, Carbon, and Adroitts were able to increase price-performance and bring control back to their data teams – all while simplifying their processes and bringing a sense of ease to their in-house data management outfits.

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Learn more about what these data warehouse types are and the benefits they provide to data analytics teams within organizations..

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine, developed by Facebook, for large-scale data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences between Presto and Snowflake in this article.

Data Lakehouse

Price-Performance Ratio of AWS Athena vs Ahana Cloud for Presto

What does AWS Athena cost? Understand the price-performance ratio of Amazon Athena vs. Ahana. Both AWS Athena and Ahana Cloud are based on the popular open-source Presto project which was originally developed by Facebook and later donated to the Linux Foundation’s Presto Foundation. There are a handful of popular services that use Presto, including both AWS Athena and Ahana Cloud. 

To explain AWS Athena pricing compared to Ahana pricing, let’s first cover what is different between them. The biggest difference between the two is that AWS Athena is a serverless architecture while Ahana Cloud is a managed service for Presto servers. The next biggest difference is the pricing model. Instead of paying for the amount of compute used by AWS Athena, you pay by the amount of data scanned. On the other hand, Ahana Cloud is priced by the amount of compute used. This can be a huge difference in price/performance. Before we get into the price-performance specifically, here’s an overview of the comparison:

AWS Athena (serverless Presto)Ahana Cloud for Presto (managed service)
Cost dimensionPay for the amount of data is scanned on on a per query basis at USD $5 per Terabyte Scanned. It may be hard to estimate how much data your queries will scan. Pay only for EC2 usage on a per node / hour basis for EC2 and Ahana
Cost effectivenessOnly pay while the query is scanning, not for idle timesOnly pay for EC2 and Ahana Cloud while compute resources are running, plus ~$4 per day for the managed service
ScaleAWS Athena can scale query workloads but has concurrency limitsAhana easily can scale query workloads without concurrency limits
Operational overheadLowest operational overhead: no need to patch OS – AWS handles thatLow operational overhead: no need to patch OS – Ahana Cloud handles that and the operation of servers
Update frequencyInfrequent updates to the platform. Not current with PrestoDB, over 60 releases behind.Frequent updates to the platform. Typically, Presto on Ahana Cloud is upgraded on a quarterly basis to keep up with the most recent releases.

Both let you focus on deriving insight from your analytical queries, as you can leave the heavy lifting of managing the infrastructure to AWS and the Ahana Cloud managed service. 

How do you define price-performance ratio?

Price–performance ratio
From Wikipedia, the free encyclopedia
In engineering, the price–performance ratio refers to a product’s ability to deliver performance, of any sort, for its price. Generally, products with a lower price/performance ratio are more desirable, excluding other factors.

Comparing the Price-Performance Ratio of Amazon Athena vs. Ahana Cloud

For this comparison, we’ll look at performance in terms of the amount of wall-clock time it takes for a set of concurrent queries to finish. The price is the total cost of running those queries. 

Instead of using a synthetic benchmark, we’ll look at the public case study on the real-world workloads from Carbon, who used Athena and then switch to Ahana Cloud. While your workloads will be different, you’ll see why the price-performance ratio is likely many times better with Ahana Cloud. And by going through an example, you’ll be able to also apply the same method when doing a quick trial (we’re here to help too.)

Here’s a few things that the Carbon public case study showed:

  • While you cannot tell how many or type of EC2 instances that are used by Athena V2, they determined that they could get similar performance with 10 c5.xlarge workers with Ahana Cloud
  • Athena V2 would start to queue queries after there were 2 other queries running, meaning that the amount of wall-clock time was extended as a result.

AWS Athena is constrained by AWS concurrency limits

AWS Athena cost

Ahana has higher concurrency so queries finish faster

Ahana cost
  • The queries would be charged at a rate of $5/TB scanned regardless of the amount of compute used. Their 7 tests ended up scanning X TBs = $Y 
  • Ahana Cloud with 10 X c5.xlarge workers has total costs of:
TypeInstancePrice/hrQty.Cost/hr
Presto Workerc5.xlarge17 cents10$1.70
Presto Coordinatorc5.xlarge17 cents1$0.17
Ahana Cloud10 cents11$1.10
Ahana Cloud Managed Service8 cents1$0.08
Total$3.45

So, you can run many queries for one hour that scan any amount of data for only $3.45 compared to one query of Athena scanning one TB of data costing $5.00.

Summary

While there is value in the simplicity of AWS Athena’s serverless approach, there are trade-offs around price-performance. Ahana Cloud can help.

Ahana is an easy cloud-native managed service with pay-as-you-go-pricing for your PrestoDB deployment. 

Ready to Compare Ahana to Athena

Start a free trial today and experience better price performance

Data Lakehouse

What is AWS Redshift Spectrum?

What is Redshift Spectrum?

Launched in 2017, Redshift Spectrum is a feature within Redshift that enables you to query data stored in AWS S3 using SQL. Spectrum allows you to do federated queries from within the Redshift SQL query editor to data in S3, while also being able to combine it with data in Redshift.

Since there is a shared name with AWS Redshift, there is some confusion as to what AWS Redshift Spectrum is. To discuss that however, it’s important to know what AWS Redshift is, namely an Amazon data warehouse product that is based on PostgreSQL version 8.0.2.

Benefits of AWS Redshift Spectrum

When compared to a similar object-store SQL engine available from Amazon such as Athena, Redshift has significantly higher and more consistent performance. Athena uses pooled resources while Spectrum is based on your Redshift cluster size and is, therefore, a known quantity.

Spectrum allows you to access your data lake files from within your Redshift data warehouse, without having to go through an ingestion process. This makes data management easier. This also reduces data latency since you aren’t waiting for ETL jobs to be written and processed.

With Spectrum, you continue to use SQL to connect to and read AWS S3 object stores in addition to Redshift. This means there are no new tools to learn and it allows you to leverage your existing skillsets to query Redshift. Under the hood, Spectrum is breaking the user queries into filtered subsets that run concurrently. These can be distributed across thousands of nodes to enhance the performance and can be scaled to query exabytes of data. The data is then sent back to your Redshift cluster for final processing.

AWS Redshift Spectrum Performance & Price

AWS Redshift Spectrum is going to be as fast as the slowest data store in your aggregated query. If you are joining from Redshift to a terabyte-sized CSV file, the performance will be extremely slow. Connecting to a well-partitioned collection of column-based Parquet stores on the other hand will be much faster. Not having indexes on the object stores means that you really have to rely on the efficient organization of the files to get higher performance.

As to price, Spectrum follows the terabyte scan model that Amazon uses for a number of its products. You are billed per terabyte of data scanned, rounded up to the next megabyte, with a 10 MB minimum per query. For example, if you scan 10 GB of data, you will be charged $0.05. If you scan 1 TB of data, you will be charged $5.00. This does not include any fees for the Redshift cluster or the S3 storage.

Redshift and Redshift Spectrum Use Case

An example of combining Redshift and Redshift Spectrum could be a high-velocity eCommerce site that sells apparel. Your historical order history is contained in your Redshift data warehouse. However, real-time orders are coming in through a Kafka stream and landing in S3 in Parquet format. Your organization needs to make an order decision for particular items because there is a long lead time. Redshift knows what you have done historically, but that S3 data is only processed monthly into Redshift. With Spectrum, the query can combine what is in Redshift and join that with the Parquet files on S3 to get an up-to-the-minute view of order volume so a more informed decision can be made.

Summary

Amazon Redshift Spectrum provides a layer of functionality to Redshift. This allows you to interact with object stores in AWS S3 without building a whole other tech stack. It makes sense for the companies who are using Redshift and need to stay there, but also need to make use of the data lake. Or for companies that are considering leaving Redshift behind and going entirely to the data lake. Redshift Spectrum does not make sense for you if all your files are in the data lake. Spectrum is very expensive as the data grows, with no visibility on the queries, this is where a managed service like Ahana for Presto fits in.

Run SQL on your Data Lakehouse

At Ahana, we have made it very simple and user friendly to run SQL workloads on Presto in the cloud. You can get started with Ahana Cloud today and start running SQL queries in a few mins.

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand. However, the pricing is not simple, since it tries to accommodate different use cases and customers.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

Lake-Formation-architecture

Building a Data Lake: How to with Lake Formation on AWS

What is an AWS Lake Formation?

Briefly, AWS lake formation helps users when building a data lake. Specifically, how to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach. AWS data lake formation builds on and works with the capabilities found in AWS Glue.

How it Works

Your root user can’t be your administrator for your data lake, so the first thing you want to do is create a new user that has full admin rights. Go to IAM and create that user and give them AdministratorAccess capability. Next, to get started with building a data lake, create an S3 bucket and any data directories you are going to use if you don’t already have something configured. Do that in the S3 segment of AWS as you would normally. If you already have an S3 location setup, you can skip that step. In either case, we then need to register that data lake location in Lake Formation. The Lake Formation menu looks like this:

Data Lake Formation

Now with your Lake Formation registered data sources, you can create a database from those sources in Lake Formation, and from there, create your Glue Crawlers as the next step of building a data lake. The crawler will take that database that you created, and go into the S3 bucket, read the directory structure and files to create your tables and fields within the database. Once you’ve run your Crawler, you’ll see the tables and fields reflected under “Tables”. The crawler creates a meta-data catalog that provides the descriptions of the underlying data that is then presented to other tools to access, such as AWS Quicksight and Ahana Presto. Amazon provides this diagram:

AWS Lake Formation

To summarize thus far, we’ve 

  • Created an admin user
  • Created an S3 bucket
  • Created three directories in the S3 bucket
  • Registered the S3 bucket as a data lake location

Benefits of Building a Data Lake with AWS Lake Formation

Having your data repositories registered and then created as a database in Lake Formation provides a number of advantages in terms of centralization of work. Fundamentally, the role of Lake Formation is to control access to data that you register. A combination of IAM roles and “Data lake permissions” is how you control this on a more macro level. Amazon shows the flow this way:

what is a data lake

Where the major advantages lie however, are with the “LF-Tags” and “LF-tag permissions”. This is where your granular security can be applied in a way that will greatly simplify your life. Leveraging Lake Formation we have two ways to assign and manage permissions to our catalog resources. There is “Named” based access and “Tag” based access.

data lake permissions

Named-based access is what most people are familiar with. You select the principal, which can be an AWS user or group of users, and assign it access to a specific database or table. The Tag-based access control method uses Lake Formation tags, called “LF Tags”. These are attributes that are assigned to the data catalog resources, such as databases, tables, and even columns, to principals in our AWS account to manage authorizations to these resources. This is especially helpful with environments that are growing and/or changing rapidly where policy management can be onerous. Tags are essentially Key/Value stores that define these permissions:

  • Tags can be up to 128 characters long
  • Values can be up to 256 characters long
  • Up to 15 values per tag
  • Up to 50 LF-Tags per resource

AWS Lake Formation Use Cases

If we wanted to control access to an employee table for example, such that HR could see everything, everyone in the company could see the names, titles, and departments of employees, and the outside world could only see job titles, we could set that up as:

  • Key = Employees
  • Values = HR, corp, public

Using this simplified view as an example:

Building a Data Lake

We have resources “employees” and “sales”, each with multiple tables, with multiple named rows. In a conventional security model, you would give the HR group full access to the employees resource, but all of the corp group would only have access to the “details” table. What if you needed to give access to position.title and payroll.date to the corp group? We would simply add the corp group LF Tag to those fields in addition to the details table, and now they can read those specific fields out of the other two tables, in addition to everything they can read in the details table. The corp group LF Tag permissions would look like this:

  • employees.details
  • employees.position.title
  • employees.payroll.date

If we were to control by named resources, it would require that each named person would have to be specifically allocated access to those databases and tables, and often there is no ability to control by column, so that part wouldn’t even be possible at a data level.

Building a Data Lake: Summary

AWS Lake Formation really simplifies the process of building a data lake, whereby you set up and manage your data lake infrastructure. Where it really shines is in the granular security that can be applied through the use of LF Tags. An AWS Lake Formation tutorial that really gets into the nitty-gritty can be found online from AWS or any number of third parties on YouTube. The open-source data lake has many advantages over a data warehouse and Lake Formation can help establish best practices and simplify getting started dramatically.

What is an Open Data Lake in the Cloud?

Data-driven insights can help business and product leaders hone in on customer needs and/or find untapped opportunities. Also, analytics dashboards can be presented to customers for added value.

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

Learn how you can start building an Open Data Lake analytics stack using Presto, Hudi and AWS S3 and solve the challenges of a data warehouse

Data Lakehouse

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand and is compatible with multitudes of AWS tools and technologies. AWS Redshift is considered the preferred cloud data warehouse of choice for most customers, but the pricing is not simple, since it tries to accommodate different use cases and customers. Let us try to understand the pricing details of Amazon Redshift.

Understanding Redshift cluster pricing

The Redshift cluster consists of multiple nodes allowing it to process data faster. This means Redshift performance depends on the node types and number of nodes. The node types can be dense compute nodes or Redshift managed storage nodes.

Dense Compute nodes: These nodes offer physical memory up to 244GB and storage capacity on SSD up to 2.5TB.

RA3 with managed Storage nodes: These nodes have physical memory up to 384GB and storage capacity on SSD up to 128TB. Additionally, when storage runs out on the nodes, Redshift will offload the data into S3. Below pricing for RA3 does not include the cost of managed storage.

Redshift pricing

Redshift spectrum pricing

Redshift spectrum is a serverless offering that allows running SQL queries directly against an AWS S3 data lake. Redshift is priced based on data scanned per TB.

redshift spectrum pricing

Concurrency scaling

Amazon Redshift allows you to grab additional resources as needed and release them when they are not needed. Every day of typical usage up to one hour is free. However, every second beyond that is charged for additional resource usage.

A pricing example as stated by Amazon Redshift, A 10 DC2.8XL node Redshift cluster in the US-East will cost $48 per hour. Consider a scenario where two transient clusters are utilized for 5 mins beyond the free concurrency scaling credits. The per-second on-demand rate will be $48 X 1/3600 = $0.13 per second. The additional cost for concurrency scaling, in this case, is 0.013 per second X 300 seconds x 2 transient clusters = $8

Amazon Redshift managed storage pricing(RMS)

Managed storage comes with RA3 node types. Usage of managed storage is calculated hourly based on the total GB stored. Managed storage does not include backup storage charges due to automated and manual snapshots.

Amazon Redshift cost

Pricing example for managed storage pricing

100 GB stored for 15 days: 100 GB X 15days x (24hours/day) =36000 GB/hours

100 TB stored for 15days: 100TB X 1024GB/TB X 15 days x (24 hours/day) = 36,864000 GB-hours

Total usage in GB-hours: 36,000 GB-Hours + 36,864000 GB-hours = 36,900,000 GB-hours

Total usage in GB-Month = 36,900,000/720 hours per month = 51,250 GB-months

Total charges for the month will be 51,250 GB-month X $0.024 = $1230

Limitation of Redshift Pricing

As you can see, Redshift has few selected instance types with limited storage.

Customers could easily hit the ceiling in terms of the node storage, and Redshift managed storage is expensive for data growth.

Redshift spectrum (a serverless option) $5 scan per TB will be an expensive option and removes the ability for the customer to scale up/down the nodes to meet their performance requirements.

Due to these limitations, Redshift is often a less than ideal solution for use cases that require diverse access to very large volumes of data, such as exploratory data science and machine learning. In these cases, many organizations would gravitate towards storing the data on Amazon S3 in a data lakehouse architecture.

If your organization is struggling to accommodate advanced use cases in Redshift, or managing increasing cloud storage costs, check out Ahana. Ahana is a powerful managed service for Presto which provides SQL on S3. Unlike Redshift spectrum, Ahana allows customers to choose the right instance type and scale up/down as needed and comes with a simple pricing model on the number of compute instances.

Want to learn from a real-life example? See how Blinkit cut their data delivery time from 24 hours to 10 minutes by moving from Redshift to Ahana – watch the case study here.

Ahana PAYGO Pricing

Ahana Cloud is easy to use, fully-integrated, and cloud native. Only pay for what you use with a pay-as-you-go model and no upfront costs.

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

Presto vs Snowflake: Data Warehousing Comparisons

Data Lakehouse

Snowflake vs Presto

This article touches on several basic elements to compare Presto and Snowflake.

To start, let’s define what each of these is. Presto is an open-source SQL query engine for data lakehouse analytics. It’s well known for ad hoc analytics on your data. One important thing to note is that Presto is not a database. You can’t store data in Presto but use it as a compute engine for your data lakehouse. You can use presto on not just the public cloud but as well as on private cloud infrastructures (on-premises or hosted).

Snowflake is a cloud data warehouse that offers a cloud-based data storage and analytics service. Snowflake runs completely on cloud infrastructure. Snowflake uses virtual compute instances for its compute needs and storage service for persistent storage of data. Snowflake cannot be run on private cloud infrastructures (on-premises or hosted).

Use cases: Snowflake vs. Presto

Snowflake is a cloud solution for your traditional data warehouse workloads such as reporting and dashboards. It is good for small-scale workloads; to move traditional batch-based reporting and dashboard-based analytics to the cloud. I discuss this limitation in the Scalability and Concurrency topic. 

Presto is not only a solution for reporting & dashboarding. With its connectors and their in-place execution, platform teams can quickly provide access to datasets that analysts have an interest in. Presto can also run queries in seconds. You can aggregate terabytes of data across multiple data sources and run efficient ETL queries. With Presto, users can query data across many different data sources including databases, data lakes, and data lakehouses.

Open Source Or Vendor lock-in

Snowflake is not Open Source Software. Data that has been aggregated and moved into Snowflake is in a proprietary format only available to Snowflake users. Surrendering all your data to the Snowflake data cloud model is the ideal recipe for vendor lock-in. 

Vendor Lock-In can lead to:

  • Excessive cost as you grow your data warehouse
  • When ingested into another system, data is typically locked into the formats of a closed source system
  • No community innovations or ways to leverage other innovative technologies and services to process that same data

Presto is an Open Source project, under the Apache 2.0 license, hosted by the Linux Foundation. Presto benefits from community innovation. An open-source project like Presto has many contributions from engineers across Twitter, Uber, Facebook, Bytedance, Ahana, and many more. Dedicated Ahana engineers are working on the new PrestoDB C++ execution engine aiming to bring high-performance data analytics to the Presto ecosystem. 

Open File Formats

Snowflake has chosen to use a micro-partition file format that is good for performance but closed source. The Snowflake engine cannot work directly with common open formats like Apache Parquet, Apache Avro, Apache ORC, etc. Data can be imported from these open formats to an internal Snowflake file format, but users miss out on performance optimizations that these open formats can bring to the engine, including dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes, avoid many small files, avoid few huge files, etc. 

On the other hand, Presto users can run ad-hoc, real-time analytics, with deep learning, on those same source files previously mentioned, without needing to copy files, so there’s more flexibility that users get with this open data lake architecture. Using open formats gives users the flexibility to pick the right engine for the right job without the need for an expensive migration. 

Open transaction format

Many organizations are adopting Data Lakehouse architecture and augmenting their current data warehouse. This brings the need for a transaction manager layer that can be supported by Apache Hudi, Apache Iceberg, or Delta Lake. Snowflake does not support all of these table formats. Presto supports all these table formats natively, allowing users more flexibility and choice. With ACID transaction support from these table formats, Presto is the SQL engine for Open Data Lakehouse. Moreover, Snowflake data warehouse doesn’t support semi/unstructured data workloads, AI/ML/data science workloads, whereas the data lakehouse does. 

Data Ownership

While Snowflake did decouple storage and compute, they did not decouple data ownership. . They  still own the compute layer as well as the storage layer. This means users must ingest data into Snowflake using a proprietary format, creating yet another copy of data and also requiring users to move their data out of their own environment. Users lose ownership of their data.

On other hand, Presto is a truly disaggregated stack that allows you to run your queries in a federated manner without any need to move your data and create multiple copies. At Ahana, users can define Presto clusters, and orchestrate and manage them in their own AWS account using cross-account roles. 

Scalability and Concurrency

With Snowflake you hit a limitation of running maximum concurrent users on a single virtual warehouse. If you have more than eight concurrent users, then you need to initiate another virtual warehouse. Query performance is good for simple queries, however, performance degrades as you apply more complex joins on large datasets and the only options available are limiting the data that you can query with Snowflake or adding more compute. Parallel writes also impact read operations and the recommendation is to have separate virtual warehouses.

Presto is designed from the ground up for fast analytic queries against data sets of any size and has been proven on petabytes of data, and supports 10-50s concurrent queries at a time

Cost of Snowflake

Users think of Snowflake as an easy and low-cost model. However, it gets very expensive and cost-prohibitive to ingest data into Snowflake. Very large amounts of data and enterprise-grade, long-running queries can result in significant costs associated with Snowflake as it requires the addition of more virtual data warehouses which can rapidly escalate costs. Basic performance improvement features like Materialized Views come with additional costs. As Snowflake is not fully decoupled, data is copied and stored into Snowflake’s managed cloud storage layer within Snowflake’s account. Hence, the users end up paying a higher cost to Snowflake than the cloud provider charges, not to mention the costs associated with cold data. Further, security features come at a higher price with a proprietary tag.

Open Source Presto is completely free. Users can run on-prem or in a cloud environment. Presto allows you to leave your data in the lowest cost storage options. You can create a portable query abstraction layer to future-proof your data architecture. Costs are for infrastructure, with no hidden cost for premium features. Data federation with Presto allows users to shrink the size of their data warehouse. By accessing the data where it is, users may cut the expenses of ETL development and maintenance associated with data transfer into a data warehouse. With Presto, you can also leverage storage savings by storing “cold” data in low-cost options like a data lake and “hot” data in a typical relational or non-relational database. 

Snowflake vs. Presto: In Summary

Snowflake is a well-known cloud data warehouse, but sometimes users need more than that – 

  1. Immediate data access as soon as it is written in a federated manner
  2. Eliminate lag associated with ETL migration when you can directly query from the source
  3. Flexible environment to run unstructured/ semi-structured or machine learning workloads
  4. Support for open file formats and storage standards to build open data lakehouse
  5. Open-source technologies to avoid vendor lock-in
  6. The cost-effective solution that is optimized for high concurrency and scalability. 

Presto can solve all these user needs in a more flexible, open-source, secure, scalable, secure, and cost-effective way. 

SaaS for Presto

If you want to use Presto, we’ve made it easy to get started in AWS. Ahana is a SaaS for Presto. With Ahana for Presto, you can run in containers on Amazon EKS making the service highly scalable & available. We have optimized Presto clusters with scale up and down compute as necessary which helps companies achieve cost control. With Ahana Cloud, you can easily integrate Presto with Apache Ranger or AWS Lake Formation and address your fine-grained access control needs. Creating a data lake with Presto and AWS Lake Formation is as simple as defining data sources and what data access and security policies you want to apply. 

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.

AWS Athena vs AWS Glue: What Are The Differences?

Here, we are going to talk about AWS Athena vs Glue, which is an interesting pairing as they are both complementary and competitive. So, what are they exactly?

Redshift-internal-architecture-

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS, and a popular choice for business intelligence and reporting use cases (see What Is Redshift Used For?

You might already be familiar with Redshift basics. However, in this article, we’ll dive a bit deeper to cover Redshift’s internal system design and how it fits into broader data lake and data warehouse architectures. Understanding these factors will help you reap the most benefits from your deployment while controlling your costs.

Redshift Data Warehouse Architecture and Main Components

Redshift internal architecture explained

As with other relational databases, storage and compute in Redshift are coupled. Data from applications, files, and cloud storage can be loaded into the data warehouse using either native AWS services such as Amazon Appflow. Data can also be uploaded through a variety of 3rd party apps such as Fivetran and Matillion. Many of these tools would also provide ELT capabilities to further cleanse, transform, and aggregate data after it has been loaded into Redshift.
Zooming in on the internal architecture, we can see that a Redshift cluster is composed of a leader node, and compute nodes that are divided into node slices, and databases. This design allows Redshift to dynamically allocate resources in order to efficiently answer queries.

Breaking Down the Redshift Cluster Components

  • The leader node is Redshift’s ‘brain’ and manages communications with external client programs. It also manages the internal communication between compute nodes. When a query is made, the leader node will parse it, compile the code and create an execution plan.
  • Compute nodes provide the ‘muscle’ – the physical resources required to perform the requested database operation. This is also where the data is actually stored. Each compute node has dedicated CPU, RAM and storage, and these differ according to the node type.
  • The execution plan distributes the workload between compute nodes, which process the data in parallel. The workload is further distributed within the node: each node is partitioned into node slices, and each node slice is allocated a portion of the compute node’s memory and disk, according to the amount of data it needs to crunch.
  • Intermediate results are sent back to the leader node. This then performs the final aggregation and sends the results to client applications via ODBC or JDBC. These would frequently be reporting and visualization tools such as Tableau or Amazon Quicksight, or internal software applications that read data from Redshift.
  • Redshift’s Internal Network provides high-speed communication between the nodes within the cluster.
  • Each Redshift cluster can contain multiple databases, with resources dynamically allocated between them.

This AWS presentation offers more details about Redshift’s internal architecture. It also provides a step-by-step breakdown of how queries are handled in Redshift and Redshift Spectrum:

Additional Performance Features

In addition to these core components, Redshift has multiple built-in features meant to improve performance:

  • Columnar storage: Redshift stores data in a column-oriented format rather than the row-based storage of traditional OLTP databases. This allows for more efficient compression and indexing.
  • Concurrency scaling: When a cluster receives a large number of requests, Redshift can automatically add resources to maintain consistent performance in read and write operations. 
  • Massively Parallel Processing (MPP): As described above, multiple compute nodes work on portions of the same query at the same time. This ensures final aggregations are returned faster.
  • Query optimizer: Redshift applies query optimizations that leverage its MPP capabilities and columnar data storage. This helps Redshift process complex SQL queries that could include multi-table joins and subqueries. 
  • Result caching: The results of certain types of queries can be stored in-memory on the leader node, which can also reduce query execution time..

Redshift vs Traditional Data Warehouses

While Redshift can replace many of the functions filled by ‘traditional’ data warehouses such as Oracle and Teradata, there are a few key differences to keep in mind:

  • Managed infrastructure: Redshift infrastructure is fully managed by AWS rather than its end users – including hardware provisioning, software patching, setup, configuration, monitoring nodes and drives, and backups.
  • Optimized for analytics: While Redshift is a relational database management system (RDBMS) based on PostgreSQL and supports standard SQL, it is optimized for analytics and reporting rather than transactional features that require very fast retrieval or updates of specific records.
  • Serverless capabilities: Introduced in 2018, Redshift serverless can be used to automatically provision compute resources after a specific SQL query is made, further abstracting infrastructure management by removing the need to size your cluster in advance.

Redshift Costs and Performance

Amazon Redshift pricing can get complicated and depends on many factors, so a full breakdown is beyond the scope of this article. There are three basic types of pricing models for Redshift usage:

  • On-demand instances are charged by the hour, with no long-term commitment or upfront fees. 
  • Reserved instances offer a discount for customers who are willing to commit to using Redshift for a longer period of time. 
  • Serverless instances are charged based on usage, so customers only pay for the capacity they consume.

The size of your dataset and the level of performance you need from Redshift will often dictate your costs. Unlike object stores such as Amazon S3, scaling storage is non-trivial from a cost perspective (due to Redshift’s coupled architecture). When implementing use cases that require granular historical datasets you might find yourself paying for very large clusters. 

Performance depends on the number of nodes in the cluster and the type of node – you can pay for more resources to guarantee better performance. Other pertinent factors are the distribution of data, the sort order of data, and the structure of the query. 

Finally, you should bear in mind that Redshift compiles code the first time a query is run, meaning queries might run faster from the second time onwards – making it more cost-effective for situations where the queries are more predictable (such as a BI dashboard that updates every day) rather than exploratory ad-hoc analysis.

Reducing Redshift Costs with a Lakehouse Architecture

We’ve worked with many companies who started out using Redshift. This was sufficient for their needs when they didn’t have much data, but they found it difficult and costly to scale as their needs evolved. 

Companies can face rapid growth in data when they acquire more users, introduce new business systems, or simply want to perform deeper exploratory analysis that requires more granular datasets and longer data retention periods. With Redshift’s coupling of storage and compute, this can cause their costs to scale almost linearly with the size of their data.

At this stage, it makes sense to consider moving from a data warehouse architecture to a data lakehouse. This would leverage inexpensive storage on Amazon S3 while distributing ETL and SQL query workloads between multiple services.

Redshift Lakehouse Architecture Explained

In this architecture, companies can continue to use Redshift for workloads that require consistent performance such as dashboard reporting, while leveraging best-in-class frameworks such as open-source Presto to run queries directly against Amazon S3. This allows organizations to analyze much more data. It allows them to do so without having to constantly up or downsize their Redshift clusters, manage complex retention policies, or deal with unmanageable costs.
To learn more about what considerations you should be thinking about as you look at data warehouses or data lakes, check out this white paper by Ventana Research: Unlocking the Value of the Data Lake.

Still Choosing Between a Data Warehouse and a Data Lake?

Watch the webinar, Data Warehouse or Data Lake, which one do I choose? In this webinar, we’ll discuss the data landscape and why many companies are moving to an open data lakehouse.

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand, but the pricing is not simple. This article breaks down how their pricing works.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

Data Lakehouse

What Is Trino?

What is Trino?

Trino is an apache 2.0 licensed, distributed SQL query engine, which was forked from the original Presto project whose Github repo was called PrestoDB. As such, it was designed from the ground up for fast queries against any amounts of data. It supports any types of data sources including relational and non-relational sources via its connector architecture.

What is the history of Trino?

Trino is a hard fork of the widely popular open source Presto project which started out at Facebook, running large-scale interactive analytic queries against a 300PB data lake using Hadoop/HDFS-based clusters. Prior to building Presto, Facebook used Apache Hive. In November, 2013, Facebook open sourced Presto under the Apache 2 license, and made it available in the public GitHub code repository named “prestodb”. In early 2019, the hard fork named Trino was started by the creators of Presto who later became cofounder/CTOs of the commercial vendor Starburst. In the meantime, Presto became part of the openly governed Presto Foundation, hosted under the guidance and experience of The Linux Foundation. Trino has subsequently diverged from Presto. Many of the innovations the community is driving in Presto are not available in outside of Presto. Today, only Presto is running at companies like Facebook, Uber, Twitter, and Alibaba.

Why is Trino so fast?

As it is a hard fork of the original Presto project, it carries with it some of the original elements which make Presto so fast, namely the in-memory execution architecture. Prior to Presto, distributed query engines such as Hive were designed to store intermediate results to disk.

How does Trino work?

It’s a distributed system that runs on Hadoop, and uses an architecture similar to massively parallel processing (MPP) databases. It has one coordinator node working with multiple worker nodes. Users submit SQL to the coordinator which uses query and execution engine to parse, plan, and schedule a distributed query plan across the worker nodes. It supports standard ANSI SQL, including complex queries, joins aggregations, and outer joins.

What is Apache Trino?

Actually, this is a misnomer in that Trino is not a project hosted under the well-known Apache Software Foundation (ASF). Apache Incubator and top level projects are subject to the naming conventions “Apache [Project Name].” An example of which is Apache Mesos. Instead trino project which is a hard fork of Presto is with a vendor controlled non-profit called the Trino Software Foundation. It is not affiliated with any well-known project hosting organizations like ASF for The Linux Foundation. The misnomer may have arisen from the fact that most open source projects use the Apache 2.0 license, which they are licensed with.

Is Trino OLAP?

It’s an open source distributed SQL query engine. It is a hard fork of the original Presto project created by Facebook. It lets developers run interactive analytics against large volumes of data. With Trino, organizations can easily use their existing SQL skills to query data without having to learn new complex languages. The architecture is quite similar to traditional online analytical processing (OLAP) systems using distributed computing architectures, in which one controller node coordinates multiple worker nodes.

What is the Trino Software Foundation?

The Software Foundation is a non-profit corporation which is controlled by the cofounders of the commercial vendor Starburst. The Trino Software Foundation has the open source Trino project. It is a hard fork of the Presto project, which is separate and hosted by the Linux Foundation. From the trino website there’s only two sentences about the foundation: “The Trino Software Foundation (formerly Presto Software Foundation) is an independent, non-profit organization with the mission of supporting a community of passionate users and developers devoted to the advancement of the Trino distributed SQL query engine for big data. It is dedicated to preserving the vision of high quality, performant, and dependable software.” What is not mentioned is any form of charter or governance. These are tables stakes for Linux Foundation projects, where the project governance is central to the project.

What SQL does Trino use?

Just like the original Presto, is built with a familiar SQL query interface that allows interactive SQL on many data sources. Standard ANSI SQL semantics are supported, including complex queries, joins, and aggregations.

What Is A Trino database?

Their distributed system runs on Hadoop/HDFS and other data sources. It uses a classic MPP model (massively parallel processing). The java-based system has a coordinator node (master) working in conjunction with a scalable set of worker nodes. Users send their SQL query through a client to the Trino coordinator which plans and schedules a distributed query plan across all its worker nodes. Both, Trino and Presto are SQL query engines and thus are not databases by themselves. They do not store any data, but from a user perspective, Trino can appear as a database because it queries the connected data stores.

What is the difference between Presto and Trino?

There are technical innovations and differences between Presto and Trino that include:
– Presto is developed, tested, and runs at scale at Facebook, Uber, and Twitter
– Presto uses 6X less memory and repartitions 2X faster with project Aria
– “Presto on Spark” today can run massive batch ETL jobs.
– Presto today is 10X faster with project RaptorX, providing caching at multiple levels
– The Presto community is making Presto more reliable and scalable with multiple coordinators instead of the single point of failure of one coordinator node.  

Trino can query data where it is stored, without needing to move data into separate warehouse or analytics database. Queries are executed in parallel with the memory of distributed worker machines. Most results return in seconds of time. Whereas Trino is a new fork, Presto continues to be used by many well-known companies: Facebook, Uber, Twitter, AWS. Trino is vendor driven project, as it is hosted in a non-profit organization which is owned by the cofounders of the Trino vendor Starburst. In comparison, Presto is hosted by Presto Foundation, a sub-foundation under The Linux Foundation. There are multiple vendors who support Presto, including the Presto as a Service (SaaS) offerings: Ahana Cloud for Presto and AWS Athena, which is based on Presto, not Trino.

As the diagram below illustrates, Presto saves time by running queries in the memory of the worker machines, running operations on intermediate datasets in-memory which is much faster, instead of persisting them to disk. It also shuffles data amongst the workers as needed. This also obviates the writes to disk between the stages. Hive intermediate data sets are persisted to disk. Presto executes tasks in-memory.

Whereas the pipelining approach between Presto and Trino is shared, Presto has a number of performance innovations that are not shared, such as caching.  For more about the differences, see the April 2021 talk by Facebook at PrestoCon Day, which describe what they, along with others like Ahana, are doing to push the technology forward.

Trino is a distributed SQL query engine that is used best for running interactive analytic workloads on your data lakes and data sources. It is used for similar use cases that the original Presto project was designed for. It allows you to query against many different data sources whether its HDFS, Postgres, MySQL, Elastic, or a S3 based data lake. Trino is built on Java and can also integrate with other third party data sources or infrastructure components. 

Trino SQL

After the query is parsed, Trino processes the workload into multiple stages across workers. Computing is done in-memory with staged pipelines.

To make Trino extensible to any data source, it was designed with storage abstraction to make it easy to build pluggable connectors. Because of this, they has a lot of connectors, including to non-relational sources like the Hadoop Distributed File System (HDFS), 
Amazon S3, Cassandra, MongoDB, and HBase, and relational sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server. Like the original community-driven open source Presto project, the data is queried where it is stored, without the need to move it into a separate analytics system.  

What to hear more about Ahana – the easiest Presto managed service ever made? Learn more about Ahana Cloud
Data Lakehouse

What is Amazon Redshift Used For?

Introduction

Amazon Redshift is one of the most widely-used services in the AWS ecosystem, and is a familiar component in many cloud architectures. In this article, we’ll cover the key facts you need to know about this cloud data warehouse, and the use cases it is best suited for. We’ll also discuss the limitations and scenarios where you might want to consider alternatives.

What is Amazon Redshift?

Amazon Redshift is a fully managed cloud data warehouse offered by AWS. First introduced in 2012. Today Redshift is used by thousands of customers, typically for workloads ranging from hundreds of gigabytes to petabytes of data.

Redshift is based on PostgreSQL 8.0.2 and supports standard SQL for database operations. Under the hood, various optimizations are implemented to provide fast performance even at larger data scales,. This includes massively parallel processing (MPP) and read-optimized columnar storage.

What is a Redshift Cluster?

A Redshift cluster represents a group of nodes provisioned as resources for a specific data warehouse. Each cluster consists of a leader and compute nodes. When a query is executed, Redshift’s MPP design means it distributes the processing power needed to return the results of an SQL query between the available nodes. It does this automatically.

Determining cluster size depends on the amount of data stored in your database. This also depends on the number of queries being executed, and the desired performance. 

Scaling and managing clusters can be done through the Redshift console, the AWS CLI, or programmatically through the Redshift Query API.

What Makes Redshift Unique?

When Redshift was first launched, it represented a true paradigm shift from traditional data warehouses provided by the likes of Oracle and Teradata. As a fully managed service, Redshift allowed development teams to shift their focus away from infrastructure and toward core application development. The ability to add compute resources automatically with just a few clicks or lines of code, rather than having to set up and configure hardware, was revolutionary and allowed for much faster application development cycles.

Today, many modern cloud data warehouses offer similar linear scaling and infrastructure-as-a-service functionality. A few notable products including Snowflake and Google BigQuery. However, Redshift remains a very popular choice and is tightly integrated with other services in the AWS cloud ecosystem.

Amazon continues to improve Redshift, and in recent years has introduced federated query capabilities, serverless, and AQUA (hardware accelerated cache).

Redshift Use Cases

Redshift’s Postgres roots mean it is optimized for online analytical processing (OLAP) and business intelligence (BI) – typically executing complex SQL queries on large volumes of data rather than transactional processing which focuses on efficiently retrieving and manipulating a single row.

Some common use cases for Redshift include:

  • Enterprise data warehouse: Even smaller organizations often work with data from multiple sources such as advertising, CRM, and customer support. Redshift can be used as a centralized repository that stores data from different sources in a unified schema and structure to create a single source of truth. This can then feed enterprise-wide reporting and analytics.
  • BI and analytics: Redshift’s fast query execution against terabyte-scale data makes it an excellent choice for business intelligence use cases. Redshift is often used as the underlying database for BI tools such as Tableau (which otherwise might struggle to perform when querying or joining larger datasets).
  • Embedded analytics and analytics as a service: Some organizations might choose to monetize the data they collect by exposing it to customers. Redshift’s data sharing, search, and aggregation capabilities make it viable for these scenarios, as it allows exposing only relevant subsets of data per customer while ensuring other databases, tables, or rows remain secure and private.
  • Production workloads: Redshift’s performance is consistent and predictable, as long as the cluster is adequately-resourced. This makes it a popular choice for data-driven applications, which might use data for reporting or perform calculations on it.
  • Change data capture and database migration: AWS Database Migration Service (DMS) can be used to replicate changes in an operational data store into Amazon Redshift. This is typically done to provide more flexible analytical capabilities, or when migrating from legacy data warehouses.

Redshift Challenges and Limitations 

While Amazon Redshift is a powerful and versatile data warehouse, it still suffers from the limitations of any relational database, including:

  • Costs: Because storage and compute are coupled, Redshift costs can quickly grow very high. This is especially noted when working with larger datasets, or with streaming sources such as application logs.
  • Complex data ingestion: Unlike Amazon S3, Redshift does not support unstructured object storage. Data needs to be stored in tables with predefined schemas. This can often require complex ETL or ELT processes to be performed when data is written to Redshift. 
  • Access to historical data: Due to the above limiting factors, most organizations choose to store only a subset of raw data in Redshift, or limit the number of historical versions of the data that they retain. 
  • Vendor lock-in: Migrating data between relational databases is always a challenge due to the rigid schema and file formats used by each vendor. This can create significant vendor lock-in and make it difficult to use other tools to analyze or access data.

Due to these limitations, Redshift is often a less than ideal solution for use cases that require diverse access to very large volumes of data, such as exploratory data science and machine learning. In these cases, many organizations would gravitate towards storing the data on Amazon S3 in a data lakehouse architecture.

If your organization is struggling to accommodate advanced Redshift use cases, or managing increasing cloud storage costs, check out Ahana. Ahana Cloud is a powerful managed service for Presto which provides SQL on S3. 

Start Running SQL on your Data Lakehouse

Go from 0 to Presto in 30 minutes and drop the limitations of the data warehouse

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

What Are The Differences Between AWS Redshift Spectrum vs AWS Athena?

There can be some confusion with the difference between AWS Redshift Spectrum and AWS Athena. Learn more about the differences in this article.

ETL process diagram

ETL and ELT in Data Warehousing

What is the difference between ETL and ELT?

What is ETL used for?

If you’re looking to understand, ETL and ELT differences let’s start with explaining what they are. ETL, or Extract Transform Load, is when an ETL tool or series of homegrown programs extracts data from a data source(s), often a relational database, and performs transformation functions. Those transformations could be data cleansing, standardizations, enrichment, etc., and then write (load) that data into a new repository, often a data warehouse. 

In the ETL process, an ETL tool or series of programs extracts the data from different RDBMS source systems, and then transforms the data, by applying calculations, concatenations, etc., and then loads the data into the Data Warehouse system.

What is ETL

What is ELT used for?

ELT, or Extract Load Transform turns the ETL process around a little bit and has you extract the raw data out from the data source and directly load it into the destination, without any processing in between. The transformation process is then done “in place” in the destination repository. Generally, the raw data is stored indefinitely so various transformations and enrichments can all be done by users with access to it, using tools they are familiar with.

What is ELT?

Both are data integration styles and have much in common with their ultimate goals, but are implemented very differently. Knowing what they are, and understanding the ETL and ELT processes, let’s dive deeper into how they differ from one another.

What is the difference between ETL and ELT?

So how does ETL vs ELT break down?

CategoryETLELT
DefinitionData is extracted from ‘n’ number of data sources. Transformed in a separate process, then loaded into the destination repository.Data is extracted from ‘n’ number of data sources and directly loaded into the destination repository. Transformation occurs inside the destination.
TransformationData is transformed within an intermediate processing step that is independent of extract and load.Data can be transformed on an ad-hoc basis during reads, or in batch and stored in another set of tables.
Code-Based TransformationsPrimarily executed in the compute-intensive transformation process.Primarily executed in the database but also done ad-hoc through analysis tools.
Data Lake SupportOnly in the sense that it can be utilized as storage for the transformation step.Well oriented for the data lake.
CostSpecialized servers for transformation can add significant costs.Object stores are very inexpensive, requiring no specialized servers.
MaintenanceAdditional servers add to the overall maintenance burden.Fewer systems mean less to maintain.
LoadingData has to be transformed prior to loading. Data is loaded directly into the destination system.
MaturityETL tools and methods have been around for decades and are well understood.Relatively new on the scene, with emerging standards and less experience.

Use Cases

Let’s take HIPAA as an example of data that would lend itself to ETL rather than ELT. The raw HIPAA data contains a lot of sensitive information about patients that isn’t allowed to be shared, so you would need to go through the transformation process prior to loading it to remove any of that sensitive information. Say your analysts were trying to track cancer treatments for different types of cancer across a geographic region. You would scrub your data down in the transformation process to include treatment dates, location, cancer type, age, gender, etc., but remove any identifying information about the patient.

An ELT approach makes more sense with a data lake where you have lots of structured, semi-structured, and unstructured data. This can also include high-velocity data where you are trying to make decisions in near real-time. Consider an MMORPG where you want to offer incentives to players in a particular region that have performed a particular task. That data is probably coming in through a streaming protocol such as Kafka and analysts are doing transforming jobs on the fly to distill it down to the necessary information to fuel the desired action.

Differences between ETL and ELT

Summary

In summary, the difference between ETL and ELT in data warehousing really comes down to how you are going to use the data as illustrated above. They satisfy very different use cases and require thoughtful planning and a good understanding of your environment and goals. If you’re exploring whether to use a data warehouse or a data lake, we have some resources that might be helpful. Check out our white paper on Unlocking the Business Value of the Data Lake which discusses the data lake approach in comparison to the data warehouse. 

Ready to Modernize Your Data Stack?

In this free whitepaper you’ll learn what the open data lakehouse is, and how it overcomes challenges of previous solutions. Get the key to unlocking lakehouse analytics.

Related Articles

5 Components of Data Warehouse Architecture

In this article we’ll look at the contextual requirements of a data warehouse, which are the five components of a data warehouse.

Data Warehouse: A Comprehensive Guide

A data warehouse is a data repository that is typically used for analytic systems and Business Intelligence tools. Learn more about it in this article.

Data Lakehouse

What Are The Differences Between AWS Redshift Spectrum vs AWS Athena?

Before we begin: Redshift Spectrum vs Redshift

While the thrust of this article is an AWS Redshift Spectrum vs Athena comparison, there can be some confusion with the difference between AWS Redshift Spectrum and AWS Redshift. Very briefly, Redshift is the storage layer/data warehouse. Redshift Spectrum, on the other hand, is an extension to Redshift that is a query engine.

Redshift spectrum

What is Amazon Athena?

Athena is Amazon’s standalone, serverless SQL query engine implementation of Presto. This is used to query data stored on Amazon S3. It is fully managed by Amazon, there is nothing to setup, manage or configure. This also means that the performance can be very inconsistent as you have no dedicated compute resources.

What is Amazon Redshift Spectrum?

Redshift Spectrum is an extension of Amazon Redshift. It is a serverless query engine that can query both AWS S3 data and tabular data in Redshift using SQL. This enables you to join data stored in external object stores with data stored in Redshift to perform more advanced queries.

Key Features & Differences: Redshift vs Athena

Athena and Redshift Spectrum offer similar functionality, namely, serverless query of S3 data using SQL. That makes them easy to manage. This also is more cost-effective as there is nothing to set up and you are only charged based on the amount of data scanned. S3 storage is significantly less expensive than a database on AWS for the same amount of data.

  • Pooled vs allocated resources: Both are serverless, however Spectrum resources are allocated based on your Redshift cluster size. Athena, however, relies on non-dedicated, pooled resources.
  • Cluster management: Spectrum actually does need a bit of cluster management, but Athena is truly serverless.
  • Performance: Performance for Athena depends on your S3 optimization, while Spectrum, as previously noted, depends on your Redshift cluster resources and S3 optimization. If you need a specific query to run more quickly, then you can allocate additional compute resources to it.
  • Standalone vs feature: Redshift Spectrum runs in tandem with Amazon Redshift, while Athena is a standalone query engine for querying data stored in Amazon S3.
  • Consistency: Spectrum provides more consistency in query performance while Athena has inconsistent results due to the pooled resources.
  • Query types: Athena is great for simpler interactive queries, while Spectrum is more oriented towards large, complex queries.
  • Pricing: The cost for both is the same. They run $5 per compressed terabyte scanned, however with Spectrum, you must also consider the Redshift compute costs.
  • Schema management: Both use AWS Glue for schema management, and while Athena is designed to work directly with Glue, Spectrum needs external tables to be configured for each Glue catalog schema.
  • Federated query capabilities: Both support federated queries.

Athena vs Redshift: Functionality

The functionality of each is very similar, namely using standard SQL to query the S3 object store. If you are working with Redshift, then Spectrum can join information in S3 with tables stored in Redshift directly. Athena also has a Redshift connector to allow for similar joins. However if you are using Redshift, it would likely make more sense to use Spectrum in this case.

Athena vs Redshift: Integrations

Keep in mind that when working with S3 objects, these are not traditional databases, which means there are no indexes to be scanned or used for joins. If you are working with files with high-cardinality and trying to join them, you will likely have very poor performance.

When connecting to data sources other than S3, Athena has a connector ecosystem to work with. This system provides a collection of sources that you can directly query with no copy required. Federated queries were added to Spectrum in 2020 and provide a similar capability with the added benefit of being able to perform transformations on the data and load it directly into Redshift tables.

AWS Athena vs Redshift: To Summarize

If you are already using Redshift, then Spectrum makes a lot of sense, but if you are just getting started with the cloud, then the Redshift ecosystem is likely overkill. AWS Athena is a good place to start if you are just getting started on the cloud and want to test the waters at low cost and minimal effort. Athena however quickly runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs will increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous Athena and/or Redshift users that saw challenges around price performance (Redshift) and concurrency/deployment control (Athena). Keep in mind that Athena and Redshift Spectrum provide the same $5 terabyte scanned cost while Ahana is priced purely at instance hours. The power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 
You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/

Ahana PAYGO Pricing

Ahana Cloud is easy to use, fully-integrated, and cloud native. Only pay for what you use with a pay-as-you-go model and no upfront costs.

Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS. Learn more about what it is and how it differs from traditional data warehouses.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.

amazon athena logo edited

Understanding AWS Athena Costs with Examples

What Is Amazon Athena? 

Since you’re reading this to understand Athena costs, you likely already know, so we’ll just very briefly touch on what it is. Amazon Athena is a managed serverless version of Presto. It provides a SQL query engine for analyzing unstructured data in AWS S3. The best use case is where reliable speed and scalability are not particularly important, meaning that, since there are no dedicated resources for the service, it will not perform in a consistent fashion. So, testing ideas, small use cases and quick ad-hoc analysis are where it makes the most sense.

How Much Does AWS Athena Cost?

An Athena query costs from $5 to $7 per terabyte scanned, depending on the region. Most materials you read will only quote the $5, but there are regions that cost $7, so keep that in mind. For our examples, we’ll use the $5 per terabyte as our base. There are no costs for failed queries, but any other charges such as the S3 storage will apply as usual for any service you are using.

AWS Athena Pricing Example

In this example, we have a screenshot from the Amazon Athena pricing calculator where we are assuming 1 query per work day per month, so 20 queries a month, that would scan 4TB of data. The cost per query works out as follows:

$5 per TB scanned * 4 TB scanned = $20 per query

So if we are doing that query 20 times per month, then we have 20 * $20 = $400 per month 

Service settings_Amazon Athena

You can mitigate these costs by storing your data compressed, if that is an option for you. A very conservative 2:1 compression rate would cut your costs in half to just $200 per month. Now, if you were to store your data in a columnar format like ORC or Parquet, then you can reduce your costs even further by only scanning the columns you need, instead of the entire row every time. We’ll use the same 50% notion where we now only have to look at half our data, and now our cost is down to $100 per month.

Let’s go ahead and try a larger example, and not even a crazy big one if you are using the data lake and doing serious processing. Let’s say you have 20 queries per day, and you are working on 100TB of uncompressed, row based data:

pricing calculator

That’s right, $304,000 per month. Twenty queries per day isn’t even unrealistic if you have some departments that are wanting to run some dashboard queries to get updates on various metrics. 

Summary

While we learned details about Athena pricing, we also saw how easy it would be to get hit with a giant bill unexpectedly. If you haven’t compressed your data, or reformatted it to reduce those costs and just dumped a bunch of CSV or JSON files into S3, then you can have a nasty surprise. If you unleash connections to Athena to your data consumers without any controls, you can also end up with some nasty surprises if they are firing off a lot of queries on a lot of data. It’s not hard to figure out what the cost will be for specific usage, and Amazon has provided the tools to do it.

If you’re an Athena user who’s not happy with costs, you’re not alone. We see many Athena users wanting more control over their deployment and in turn, costs. That’s where we can help – Ahana is SaaS for Presto (the same technology that Athena is running) that gives you more control over your deployment. Typically our customers see up to 5.5X price performance improvements on their queries as compared to Athena. 

You can learn more about how Ahana compares to AWS Athena in this comparison page.

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine for data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences in this article

Data Warehouse Architecture

5 Components of Data Warehouse Architecture

Data Warehouse Architecture

Tip: If you are struggling to get value from your data warehouse due to vendor lock-in or handling semi and unstructured data, set up a time to chat with an engineer about migrating to a Data Lakehouse.

What are the components of a data warehouse?

Most data warehouses will be built around a relational database system, either on-premise or in the cloud, where data is both stored and processed. Other components would include metadata management and an API connectivity layer allowing the warehouse to pull data from organizational sources and provide access to analytics and visualization tools.

A typical data warehouse has four main components: a central database, ETL (extract, transform, load) tools, metadata, and access tools. All of these components are engineered for speed so that you can get results quickly and analyze data on the fly.

The data warehouse has been around for decades. Born in the 1980s, it addressed the need for optimized analytics on data. As companies’ business applications began to grow and generate/store more data, they needed a system that could both manage the data and analyze it. At a high level, database admins could pull data from their operational systems and add a schema to it via transformation before loading it into their data warehouse (this process is also known as ETL – Extract, Transform, Load).

To learn more about the internal architecture of a data warehouse and its various components such as nodes and clusters, check out our previous article on Redshift data warehouse architecture.

Schema is made up of metadata (data about the data) so users could easily find what they were looking for. The data warehouse could also connect to many different data sources, so it became an easier way to manage all of a company’s data for analysis.

As data warehouse architecture evolved and grew in popularity, more people within a company started using it to access data – and the data warehouse made it easy to do so with structured data. This is where metadata became important. Reporting and dashboarding became a key use case, and SQL (structured query language) became the de facto way of interacting with that data.

Here’s a quick high level overview of the data warehouse architecture:

Data Warehouse Architecture
In this article we’ll look at the contextual requirements of data warehouse architecture, and the five components of a data warehouse. 

The 5 components of a data warehouse architecture are:
  1. ETL
  2. Metadata
  3. SQL Query Processing
  4. Data layer
  5. Governance/security

ETL 

As mentioned above, ETL stands for Extract, Transform, Load. When DBAs want to move data from a data source into their data warehouse, this is the process they use. In short, ETL converts data into a usable format so that once it’s in the data warehouse, it can be analyzed/queried/etc. For the purposes of this article, I won’t go into too much detail of how the entire ETL process works, but there are many different resources where you can learn about ETL.

Metadata

Metadata is data about data. Basically, it describes all of the data that’s stored in a system to make it searchable. Some examples of metadata include authors, dates, or locations of an article, create date of a file, the size of a file, etc. Think of it like the titles of a column in a spreadsheet. Metadata allows you to organize your data to make it usable, so you can analyze it to create dashboards and reports.

SQL Query Processing

SQL is the de facto standard language for querying your data. This is the language that analysts use to pull out insights from their data stored in the data warehouse. Typically data warehouses have proprietary SQL query processing technologies tightly coupled with the compute. This allows for very high performance when it comes to your analytics. One thing to note, however, is that the cost of a data warehouse can start getting expensive the more data and SQL compute resources you have.

Data Layer

The data layer is the access layer that allows users to actually get to the data. This is typically where you’d find a data mart. This layer partitions segments of your data out depending on who you want to give access to, so you can get very granular across your organization. For instance, you may not want to give your Sales team access to your HR team’s data, and vice versa.

Governance/Security

This is related to the data layer in that you need to be able to provide fine grained access and security policies across all of your organization’s data. Typically data warehouses have very good governance and security capabilities built in, so you don’t need to do a lot of custom engineering work to include this. It’s important to plan for governance and security as you add more data to your warehouse and as your company grows.

Challenges with a Data Warehouse Architecture

Now that I’ve laid out the five key components of a data warehouse architecture, let’s discuss some of the challenges of the data warehouse. As companies start housing more data and needing more advanced analytics and a wide range of data, the data warehouse starts to become expensive and not so flexible. If you want to analyze unstructured or semi-structured data, the data warehouse won’t work. 

We’re seeing more companies moving to the Data Lakehouse architecture, which helps to address the above. The Open Data Lakehouse allows you to run warehouse workloads on all kinds of data in an open and flexible architecture. Instead of a tightly coupled system, the Data Lakehouse is much more flexible and also can manage unstructured and semi-structured data like photos, videos, IoT data, and more. Here’s what that architecture looks like:

Data Warehouse

The Data Lakehouse can also support your data science, ML and AI workloads in addition to your reporting and dashboarding workloads. If you are looking to upgrade from data warehouse architecture, then developing an Open Data Lakehouse is the way to go.


If you’re interested in learning more about why companies are moving from the data warehouse to the data lakehouse, check out this free whitepaper on how to Unlock the Business Value of the Data Lake/Data Lakehouse.


Related Articles

Data Warehouse: A Comprehensive Guide

Data warehouse architecture is a data repository, typically used for analytic systems and Business Intelligence tools. Take a look at this article to get a better understand of what it is and how it’s used.

Data Warehouse Concepts for Beginners

A relational database that is designed for query and analysis rather than for transaction processing. Learn more here.

What is a Data Lakehouse Architecture?

Overview

The term Data Lakehouse has become very popular over the last year or so, especially as more customers are migrating their workloads to the cloud. This article will help to explain what a Data Lakehouse architecture is, and how companies are using the Data Lakehouse in production today. Finally, we’ll share a bit on where Ahana Cloud for Presto fits into this architecture and how real companies are leveraging Ahana as the query engine for their Data Lakehouse.

What is a Data Lakehouse?

First, it’s best to explain a Data Warehouse and a Data Lake.

Data Warehouse

A data warehouse is one central place where you can store specific, structured data. Most of the time that’s relational data that comes from transactional systems, business apps, and operational databases. You can run fast analytics on the Data Warehouse with very good price/performance. Using a data warehouse typically means you’re locked into that Data Warehouse’s proprietary formats – the trade off for the speed and price/performance is your data is ingested and locked into that warehouse, so you lose the flexibility of a more open solution.

Data Lake

On the other hand, a Data Lake is one central place where you can store any kind of data you want – structured, unstructured, etc. – at scale. Popular Data Lakes are AWS S3, Microsoft Azure, and Google Cloud Storage. Data Lakes are widely popular because they are very cheap and easy to use – you can literally store an unlimited amount of any kind of data you want at a very low cost. However, the data lake doesn’t provide built-in mechanisms like query, analytics, etc. You need a query engine and data catalog on top of the data lake to query your data and make use of it (that’s where Ahana Cloud comes in, but more on that later).

Data Lakehouse explained_diagram

Data Lakehouse

Now let’s look at the Data Lake vs the Lakehouse. This new data lakehouse architecture has emerged that takes the best of the Data Warehouse and Data Lake. That means it’s open, flexible, has good price/performance, and can scale like the Data Lake, and can also do transactions and have strong security like that of the Data Warehouse.

Data Lakehouse Architecture Explained

Here’s an example of a Data Lakehouse architecture:

An example of a Data Lakehouse architecture

You’ll see the key components include your Cloud Data Lake, your catalog & governance layer, and the data processing (SQL query engine). On top of that you can run your BI, ML, Reporting, and Data Science tools. 

There are a few key characteristics of the Data Lakehouse. First, it’s based on open data formats – think ORC, Parquet, etc. That means you’re not locked into a proprietary format and can use an open source query engine to analyze your data. Your lakehouse data can be easily queried with SQL engines.

Second, a governance/security layer on top of the data lake is important to provide fine-grained access control to data. Last, performance is critical in the Data Lakehouse. To compete with data warehouse workloads, the data lakehouse needs a high-performing SQL query engine on top. That’s where open source Presto comes in, which can provide that extreme performance to give you similar, if not better, price/performance for your queries.

Building your Data Lakehouse with Ahana Cloud for Presto

At the heart of the Data Lakehouse is your high-performance SQL query engine. That’s what enables you to get high performance analytics on your data lake data. Ahana Cloud for Presto is SaaS for Presto on AWS, a really easy way to get up and running with Presto in the cloud (it takes under an hour). This is what your Data Lakehouse architecture would look like if you were using Ahana Cloud:

Building your Data Lakehouse with Ahana Cloud for Presto_diagram

Ahana comes built-in with a data catalog and caching for your S3-based data lake. With Ahana you get the capabilities of Presto without having to manage the overhead – Ahana takes care of it for you under the hood. The stack also includes and integrates with transaction managers like Apache Hudi, Delta Lake, and AWS Lake Formation.

We shared more on how to unlock your data lake with Ahana Cloud in the data lakehouse stack in a free on-demand webinar.

Ready to start building your Data Lakehouse? Try it out with Ahana. We have a 14-day free trial (no credit card required), and in under 1 hour you’ll have SQL running on your S3 data lake.

What is an Open Data Lake in the Cloud?

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Check out this article for more information about data warehouses.

Data Lakehouse

Enterprise Data Lake Formation & Architecture on AWS

What is Enterprise Data Lake

An enterprise data lake is simply a data lake for enterprise-wide information sharing and storing of data. The key purpose of “Enterprise data lake” is to incorporate analytics on it to unlock business insights from the stored data.

Why AWS Lake formation for Enterprise Data Lake

The key purpose of “Enterprise Data Lake” is to run analytics to gain business insights. As part of that process,governance of data becomes more important to secure the access of data between different roles in the enterprise. AWS Lake Formation is a service that makes it easy to set up a secure data lake very quickly (in a matter of days), providing a governance layer for data lakes on AWS S3. 

Enterprise Data Lake Formation & Architecture

Enterprise data platforms need a simpler, scalable, and centralized way to define and enforce access policies on their data lakes . A policy based approach to allow their data lake consumers to use the analytics service of their choice, to best suit the operations they want to perform on the data. Although the existing method of using Amazon S3 bucket policies to manage access control is an option, when the number of combinations of access levels and users increase, it may not be an option for enterprises.

Data Lake Storage diagram

AWS Lake Formation allows enterprises to simplify and centralize access management. It allows organizations to manage access control for Amazon S3-based data lakes using familiar concepts of databases, tables, and columns (with more advanced options like row and cell-level security). 

Benefits of Lake formation for Enterprise Data Lakes

  •  One schema – shareable with no dependency on architecture
  •  Share Lake Formation databases and tables to any Amazon accounts 
  •  No Amazon S3 policy edits required 
  •  Receivers of the data can use analytic service provider like Ahana to run analytics.
  •  There is no dependency between roles on how the data will be further shared.
  •  Centralized logging.

AWS Enterprise Lake Formation: To Summarize

AWS Lake Formation has been integrated with AWS partners like Ahana cloud, a managed service for SQL on data lakes. These services honor the Lake Formation permissions model out of the box, which makes it easy for customers to simplify, standardize, and scale data security management for data lakes.

Related Articles

The Role of Blueprints in Lake Formation

A Lake Formation Blueprint allows you to easily stamp out and create workflows. Learn more about what is in this article.

What is a Data Lakehouse Architecture?

The term Data Lakehouse has become very popular over the last year or so. Learn more about what it is and how it’s used.

Data Lakehouse

How to Query Your JSON Data Using Amazon Athena

AWS Athena is Amazon’s serverless implementation of Presto, which means they generally have the same features. A popular use case is to use Athena to query Parquet, ORC, CSV and JSON files that are typically used for querying directly, or transformed and loaded into a data warehouse. Athena allows you to extract data from, and search for values and parse JSON data.


JSON Data

Using Athena to Query Nested JSON

To have Athena query nested JSON, we just need to follow some basic steps. In this example, we will use a “key=value” to query a nested value in a JSON. Consider the following AWS Athena JSON example:

[
  {
    "name": "Sam",
    "age": 45,
    "cars": {
      "car1": {
        "make": "Honda"
      },
      "car2": {
        "make": "Toyota"
      },
      "car3": {
        "make": "Kia"
      }
    }
  },
  {
    "name": "Sally",
    "age": 21,
    "cars": {
      "car1": {
        "make": "Ford"
      },
      "car2": {
        "make": "SAAB"
      },
      "car3": {
        "make": "Kia"
      }
    }
  },
  {
    "name": "Bill",
    "age": 68,
    "cars": {
      "car1": {
        "make": "Honda"
      },
      "car2": {
        "make": "Porsche"
      },
      "car3": {
        "make": "Kia"
      }
    }
  }
]

We want to retrieve all “name”, “age” and “car2” values out of the array:

SELECT name, age, cars.car2.make FROM the_table; 
name age cars.car2
Sam45 Toyota
Sally21 SAAB
Bill68 Porsche

That is a pretty simple use case of  retrieving certain fields out of the JSON. The complexity was the cars column with the key/value pairs and we needed to identify which field we wanted. Nested values in a JSON can be represented as “key=value”, “array of values” or “array of key=value” expressions. We’ll illustrate the latter two next.

How to Query a JSON Array with Athena

Abbreviating our previous example to illustrate how to query an array, we’ll use a car dealership and car models, such as:

{
	"dealership": "Family Honda",
	"models": [ "Civic", "Accord", "Odyssey", "Brio", "Pilot"]
}

We have to unnest the array and connect it to the original table:

SELECT dealership, cars FROM dataset
CROSS JOIN UNNEST(models) as t(cars)
dealershipmodels
Family Honda Civic
Family HondaAccord
Family HondaOdyssey
Family HondaBrio
Family HondaPilot

Finally we will show how to query nested JSON with an array of key values.

Query Nested JSON with an Array of Key Values

Continuing with the car metaphor, we’ll consider a dealership and the employees in an array:

dealership:= Family Honda

employee:= [{name=Allan, dept=service, age=45},{name=Bill, dept=sales, age=52},{name=Karen, dept=finance, age=32},{name=Terry, dept=admin, age=27}]

To query that data, we have to first unnest the array and then select the column we are interested in. Similar to the previous example, we will cross join the unnested column and then unnest it:

select dealership, employee_unnested from dataset
cross join unnest(dataset.employee) as t(employee2)
dealershipemployee_unnested
Family Honda {name=Allan, dept=service, age=45}
Family Honda{name=Bill, dept=sales, age=52}
Family Honda{name=Karen, dept=finance, age=32}
Family Honda{name=Terry, dept=admin, age=27}

By using the “.key”, we can now retrieve a specific column:

select dealership,employee_unnested.name,employee_unnested.dept,employe_unnested.age from dataset
cross join unnest(dataset.employee) as t(employee_unnested)
dealershipemployee_unnested.nameemployee_unnested.deptemployee_unnested.age
Family HondaAllenservice45
Family HondaBillsales52
Family HondaKarenfinance32
Family HondaTerryadmin27

Using these building blocks, you can start to test on your own JSON files using Athena to see what is possible. Athena, however, runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous AWS Athena users that saw challenges around price performance and concurrency/deployment control. Keep in mind, Athena costs from $5 to around $7 dollars per terabyte scanned cost, depending on the region. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 


You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/

Related Articles

What is Presto?

Take a deep dive into Presto: what it is, how it started, and the benefits.

How to Build a Data Lake Using Lake Formation on AWS

AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach.

Data Lakehouse

AWS Athena vs AWS Glue: What Are The Differences?

Amazon’s AWS platform has over 200 products and services, which can make understanding what each one does and how they relate confusing. Here, we are going to talk about AWS Athena vs Glue, which is an interesting pairing as they are both complementary and competitive. So, what are they exactly?

What is AWS Athena?

AWS Athena is a serverless implementation of Presto. Presto is an interactive query service that allows you to query structured or unstructured data straight out of S3 buckets.

What is AWS Glue?

AWS Glue is also serverless, but more of an ecosystem of tools to allow you to easily do schema discovery and ETL with auto-generated scripts that can be modified either visually or via editing the script. The most commonly known components of Glue are Glue Metastore and Glue ETL. Glue Metastore is a serverless hive compatible metastore which can be used in lieu of your own managed Hive. Glue ETL on the other hand is a Spark service which allows customers to run Spark jobs without worrying about the configuration, manageability and operationalization of the underlying Spark infrastructure. There are other services such as Glue Data Wrangler which we will keep outside the scope of this discussion.

AWS Athena vs AWS Glue

Where this turns from AWS Glue vs AWS Athena to AWS Glue working with Athena is with the Glue Catalog. The Glue catalog is used as a central hive-compatible metadata catalog for your data in AWS S3. It can be used across AWS services – Glue ETL, Athena, EMR, Lake formation, AI/ML etc. A key difference between Glue and Athena is that Athena is primarily used as a query tool for analytics and Glue is more of a transformation and data movement tool.

Some examples of how Glue and Athena can work together would be:

CREATE EXTERNAL TABLE sampleTable (
  col1 INT,
  col2 INT,
  str1 STRING,
  ) STORED AS AVRO
  TBLPROPERTIES (
  'classification'='avro')
  • Creating tables for Glue to use in ETL jobs. The table must have a property added to them called a classification, which identifies the format of the data. The classification values can be csv, parquet, orc, avro, or json. An example CREATE TABLE statement in Athena would be:
  • Transforming data into a format that is better optimized for query performance in Athena, which will also impact cost as well. So, converting a CSV or JSON file into Parquet for example.

Query S3 Using Athena & Glue

Now how about querying S3 data utilizing both Athena and Glue? There are a few steps to set it up, first, we’ll assume a simple CSV file with IoT data in it, such as:

CSV table answers AthenaVGlue

We would first upload our data to an S3 bucket, and then initiate a Glue crawler job to infer the schema and make it available in the Glue catalog. We can now use Athena to perform SQL queries on this data. Let’s say we want to retrieve all rows where ‘att2’ is ‘Z’, the query looks like this:

SELECT * FROM my_table WHERE att2 = 'Z';

From here, you can perform any query you want, you can even use Glue to transform the source CSV file into a Parquet file and use the same SQL statement to read the data. You are insulated from the details of the backend as a data analyst using Athena, while the data engineers can optimize the source data for speed and cost using Glue.

AWS Athena is a great place to start if you are just getting started on the cloud and want to test the waters at low cost and minimal effort. Athena however quickly runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs will increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous Athena users that saw challenges around price performance and concurrency/deployment control. Ahana is also tightly integrated with the Glue metastore, making it simple to map and query your data. Keep in mind that Athena costs $5 per terabyte scanned cost. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 

You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/ 

Query editor

Querying Parquet Files using Amazon Athena

Parquet is one of the latest file formats with many advantages over some of the more commonly used formats like CSV and JSON. Specifically, Parquet’s speed and efficiency of storing large volumes of data in a columnar format are big advantages that have made it more widely used. It supports many optimizations and stores metadata around its internal contents to support fast lookups and searches by modern distributed querying/compute engines like PrestoDB, Spark, Drill, etc. Here are steps to quickly get set up to query your parquet files with a service like Amazon Athena.

How to query Parquet files

Parquet is a columnar storage that is optimized for analytical querying. Data warehouses such as Redshift support Parquet for optimized performance. Parquet files stored on Amazon S3 can be queried directly by serverless query engines such as Amazon Athena or open source Presto using regular SQL.

Below we show a simple example of running such a query.

Prerequisites

  • Sample Parquet Data –  https://ahana-labs.s3.amazonaws.com/movielens/ratings/ratings.parquet
  • AWS Account and Role with access to below services:
    • AWS S3
    • AWS Glue (Optional but highly recommended)
    • AWS Athena

Setting up the Storage

For this example we will be querying the parquet files from AWS S3. To do this, we must first upload the sample data to an S3 bucket. 

Log in to your AWS account and select the S3 service in the Amazon Console.

  1. Click on Create Bucket
  2. Choose a name that is unique. For this example I chose ‘athena-parquet-<your-initials>’. S3 is a global service so try to include a unique identifier so that you don’t choose a bucket that has already been created. 
  3. Scroll to the bottom and click Create Bucket
  4. Click on your newly created bucket
  5. Create a folder in the S3 bucket called ‘test-data’
  6. Click on the newly created folder
  7. Choose Upload Data and upload your parquet file(s).

Running a Glue Crawler

Now that the data is in S3, we need to define the metadata for the file. This can be tedious and involve using a different reader program to read the parquet file to understand the various column field names and types. Thankfully, AWS Glue provides a service that can scan the file and fill in the requisite metadata auto-magically. To do this, first navigate to the AWS Glue service in the AWS Console.

  1. On the AWS Glue main page, select ‘Crawlers’ from the left hand side column
  2. Click Add Crawler
  3. Pick a name for the crawler. For this demo I chose to use ‘athena-parquet-crawler’. Then choose Next.
  4. In Crawler Source Type, leave the settings as is (‘Data Stores’ and ‘Crawl all folders’) and choose Next.
  5. In Data Store under Include Path, type in the URL of your S3 bucket. It should be something like ‘s3://athena-parquet-<your-initials>/test-data/’.
  6. In IAM Role, choose Create an IAM Role and fill the suffix with something like ‘athena-parquet’. Alternatively, you can opt to use a different IAM role with permissions for that S3 bucket.
  7. For Frequency leave the setting as default and choose Next
  8. For Output, choose Add Database and create a database with the name ‘athena-parquet’. Then choose Next.
  9. Review and then choose Finish.
  10. AWS will prompt you if you would like to run the crawler. Choose Run it now or manually run the crawler by refreshing the page and selecting the crawler and choosing the action Run.
  11. Wait for the crawler to finish running. You should see the number 1 in the column Tables Added for the crawler.

Querying the Parquet file from AWS Athena

Now that the data and the metadata are created, we can use AWS Athena to query the parquet file. Choose the Athena service in the AWS Console.

  1. Choose Explore the Query Editor and it will take you to a page where you should immediately be able to see a UI like this:
Query Editor
  1. Before you can proceed, Athena will require you to set up a Query Results Location. Select the prompt and set the Query Result Location to ‘s3://athena-parquet-<your-initials>/test-results/’.
  2. Go back to the Editor and type the following statement: ‘SELECT * FROM test_data LIMIT 10;’ The table name will be based on the folder name you chose
  3. The final result should look something like this: in the S3 storage step.
Query Editor v2

Conclusion

Some of these steps, like using Glue Crawlers, aren’t required but are a better approach for handling Parquet files where the schema definition is unknown. Athena itself is a pretty handy service for getting hands on with the files themselves but it does come with some limitations. 

Those limitations include concurrency limits, price performance impact, and no control of your deployment. Many companies are moving to a managed service approach, which takes care of those issues. Learn more about AWS Athena limitations and why you should consider a managed service like Ahana for your SQL on S3 needs.

icon-aws-lake-formation.png

The Role of Blueprints in Lake Formation on AWS

Why does this matter?

There are 2 major steps to create a Data Lakehouse on AWS, first is to set up your S3-based Data Lake and second is to run analytical queries on your data lake. A popular SQL engine that you can use is Presto. This article is focused on the first step and how AWS Lake Formation Blueprints can make that easy and automated. Before you can run analytics to get insights, you need your data continuously pooling into your lake!

AWS Lake Formation helps with the time-consuming data wrangling involved with maintaining a Data Lake. It makes that simple and secure. In Lake Formation, there is the Workflows feature. Workflows encompasses a complex set of ETL jobs to load and update data. 

work flow diagram

What is a Blueprint?

A Lake Formation Blueprint allows you to easily stamp out and create workflows. This is an automation capability within Lake Formation. There are 3 types: Database snapshots, incremental database, and log file blueprints.

The database blueprints support automated data ingestion of sources like MySQL, PostgreSQL, SQL service to the Open Data Lake. It’s a point and click service with simple forms in the AWS console.

A Database snapshot does what it sounds like, it loads all the tables from a JDBC source to your lake. This is good when you want time stamped end-of-period snapshots to compare later.

An Incremental database also does what it sounds like, taking only the new data or the deltas into the data lake. This is faster and keeps the latest data in your data lake. The Incremental database blueprint uses bookmarks on columns for each successive incremental blueprint run. 

The Log file blueprint takes logs from various sources and loads them into the data lake. ELB logs, ALB logs, and Cloud Trail logs are an example of popular log files that can be loaded in bulk. 

Summary and how about Ahana Cloud?

Getting data into your data lake is easy, automated, and consistent with AWS Lake Formation. Once you have your data ingested, you can use a managed service like Ahana Cloud for Presto to enable fast queries on your data lake to derive important insights for your users. Ahana Cloud has integrations with AWS Lake Formation governance and security policies. See that page here: https://ahana.io/aws-lake-formation 

lake formation diagram
Mysql

Presto equivalent of mysql group_concat

As you may know, PrestoDB supports ANSI SQL and includes support for several SQL dialects. These dialects include MySQL dialect, making it easy to group and aggregate data in a variety of ways. However not ALL functions in MySQL are supported by PrestoDB! 

Mysql

Now, let us look at the really useful MySQL and MariaDB SQL function GROUP_CONCAT(). This function is used to concatenate data in column(s) from multiple rows into one field. It is an aggregate (GROUP BY) function which returns a String. This is assuming the group contains at least one non-NULL value (otherwise it returns NULL). GROUP_CONCAT() is an example of a function that is not yet supported by PrestoDB and this is the error you will see if you try using it to get a list of customers that have ordered something along with their order priorities:

presto> use tpch.sf1;

presto:sf1> select custkey, GROUP_CONCAT(DISTINCT orderpriority ORDER BY orderpriority SEPARATOR ',') as OrderPriorities from orders GROUP BY custkey;

Query 20200925_105949_00013_68x9u failed: line 1:16: Function group_concat not registered

Is there a way to handle this? If so what’s the workaround? There is!

array_join() and array_agg() to the rescue! 

presto:sf1> select custkey,array_join(array_distinct(array_agg(orderpriority)),',') as OrderPriorities from orders group by custkey;
 custkey |                OrderPriorities                 
---------+------------------------------------------------
   69577 | 2-HIGH,1-URGENT,3-MEDIUM,5-LOW,4-NOT SPECIFIED 
   52156 | 4-NOT SPECIFIED,3-MEDIUM,1-URGENT,5-LOW,2-HIGH 
  108388 | 5-LOW,4-NOT SPECIFIED,2-HIGH,3-MEDIUM,1-URGENT 
  111874 | 5-LOW,1-URGENT,2-HIGH,4-NOT SPECIFIED          
  108616 | 1-URGENT,5-LOW,4-NOT SPECIFIED,3-MEDIUM,2-HIGH 
(only the first 5 rows displayed) 

If you do not want to use the DISTINCT operator (you want duplicates in your result set in other words) then there is an easy solution. To skip the DISTINCT operator, simply drop the array_distinct() function from your query:

presto:sf1> select custkey,array_join(array_agg(orderpriority),',') as OrderPriorities from orders group by custkey;
 custkey | OrderPriorities                             
---------+-------------------------------------------------------------------------------- 
   24499 | 5-LOW,1-URGENT,4-NOT SPECIFIED,3-MEDIUM,2-HIGH,4-NOT SPECIFIED,3-MEDIUM,1-URGENT,2-HIGH,3-MEDIUM,1-URGENT,5-LOW,3-MEDIUM,4-NOT SPECIFIED,4-NOT SPECIFIED,4-NOT SPECIFIED,3-MEDIUM,3-MEDIUM,5-LOW,1-URGENT,1-URGENT,4-NOT SPECIFIE
   58279 | 4-NOT SPECIFIED,2-HIGH,5-LOW,1-URGENT,1-URGENT,5-LOW,5-LOW,4-NOT SPECIFIED,1-URGENT,4-NOT SPECIFIED,5-LOW,3-MEDIUM,1-URGENT,4-NOT SPECIFIED,4-NOT SPECIFIED,1-URGENT,5-LOW,5-LOW,3-MEDIUM,3-MEDIUM,1-URGENT,3-MEDIUM,2-HIGH,5-LOW
  142027 | 1-URGENT,2-HIGH,2-HIGH,1-URGENT,3-MEDIUM,1-URGENT,5-LOW,4-NOT SPECIFIED,4-NOT SPECIFIED,2-HIGH,3-MEDIUM,2-HIGH,1-URGENT,3-MEDIUM,5-LOW,3-MEDIUM,4-NOT SPECIFIED,2-HIGH,1-URGENT,5-LOW,2-HIGH,5-LOW,1-URGENT,4-NOT SPECIFIED,2-HIG
   94169 | 1-URGENT,4-NOT SPECIFIED,4-NOT SPECIFIED,1-URGENT,4-NOT SPECIFIED,3-MEDIUM,4-NOT SPECIFIED,3-MEDIUM,4-NOT SPECIFIED,5-LOW,4-NOT SPECIFIED,2-HIGH,5-LOW,4-NOT SPECIFIED                                                                                                                                                                                                        
   31607 | 4-NOT SPECIFIED,2-HIGH,4-NOT SPECIFIED,2-HIGH,2-HIGH,5-LOW 

You can, of course specify the separator character. In the example shown above, I have used a comma as the separator.

It’s worth noting that, like PrestoDB, there wasn’t a T-SQL equivalent of the MySQL GROUP_CONCAT() function in Microsoft SQL Server either. However, T-SQL now has the STRING_AGG() function which is available from SQL Server 2017 onwards.

And hey, presto, you now have a working Presto equivalent of mysql group_concat.


Understanding the Presto equivalent of mysql group_concat – Now what?

If you are looking for more tips are tricks, here’s your next step. Check out our Answers section to learn more about PrestoDB, competitor reviews and comparisons, and more technical guides and resources to get you started.


Related Articles

What is Presto?

What’s Presto, how did it start, and what is it for? Ready to answer these questions? Take a deeper dive into understanding Presto. Learn what PrestoDB is, how it got started, and the benefits for Presto users.

How to Build a Data Lake Using Lake Formation on AWS

What is AWS Lake Formation? AWS lake formation helps users to build, manage, and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach. Learn more about AWS Lake Formation, including the pros and cons of Amazon Lake Formation.

Data Warehouse: A Comprehensive Guide

Looking to learn more about data warehouses? Start here for a deeper look. This article will cover topics like what it is: a data warehouse is a data repository. Also get more info about their use: typically a warehouse is used for analytic systems and Business Intelligence tools. Take a look at this article to get a better understand of what it is, how it’s used, and the pros and cons of a data warehouse compared to a data lake.

AWS Athena Alternatives

Querying Amazon S3 Data Using AWS Athena

The data lake is becoming increasingly popular for more than just data storage. Now we see much more flexibility with what you can do with the data lake itself – add a query engine on top to get ad hoc analytics, reporting and dashboarding, machine learning, etc. In this article we’ll look more closely at AWS S3 and AWS Athena.

How Does AWS Athena work with Amazon S3

In AWS land, AWS S3 is the de facto data lake. Many AWS users who want to start easily querying that data will use Amazon Athena, a serverless query service that allows you to run ad hoc analytics using SQL on your data. Amazon Athena is built on Presto, the open source SQL query engine that came out of Meta (Facebook) and is now an open source project housed under the Linux Foundation. One of the most popular use cases is to query S3 with Athena.

The good news about Amazon Athena is that it’s really easy to get up and running. You can simply add the service and start running queries on your S3 data lake right away. Because Athena is based on Presto, you can query data in many different formats including JSON, Apache Parquet, Apache ORC, CSV, and a few more. Many companies today use Athena to query S3.

How to query S3 using AWS Athena

The first thing you’ll need to do is create a new bucket in AWS S3 (or you can you an existing, though for the purposes of testing it out creating a new bucket is probably helpful). You’ll use Athena to query S3 buckets. Next, open up your AWS Management Console and go to the Athena home page. From there you have a few options in how to create a table, for this example just select the “Create table from S3 bucket data” option. 

From there, AWS has made it fairly easy to get up and running in a quick 4 step process where you’ll define the database, table name, and S3 folder where data for this table will come from. You’ll select the data format, define your columns, and then set up your partitions (this is if you have a lot of data). Briefly laid out:

  1. Set up your Database, Table, and Folder Names & Locations
  2. Choose the data format you’ll be querying
  3. Define your columns so Athena understands your data schema
  4. Set up your Data Partitions if needed

Now you’re ready to start querying with Athena. You can run simple select statements on your data, giving you the ability to run SQL on your data lake.

What happens when AWS Athena hits its limits

While Athena is very easy to get up and running, it has known limitations that start impacting price performance as usage grows. That includes query limits, partition limits, deterministic performance, and some others. It’s actually why we see a lot of previous Athena users move to Ahana Cloud for Presto, our managed service for Presto on AWS. 

Here’s a quick comparison between the two offerings:

AWS Athena replacement

Some of our customers shared why they moved from AWS Athena to Ahana Cloud. Adroitts saw 5.5X price performance improvement, faster queries, and more control after they made the switch, while SIEM leader Securonix saw 3X price performance improvement along with better performing queries.

We can help you benchmark Athena against Ahana Cloud, get in touch with us today and let’s set up a call.

Related Articles


What is an Open Data Lake in the Cloud?

Data-driven insights can help business and product leaders hone in on customer needs and/or find untapped opportunities. Also, analytics dashboards can be presented to customers for added value.

Building an Open Data Lakehouse with Presto, Hudi and AWS S3

Learn how you can start building an Open Data Lake analytics stack using Presto, Hudi and AWS S3 and solve the challenges of a data warehouse

AWS Lake Formation vs AWS Glue – What are the differences?

Last updated: October 2022

As you start building your analytics stack in AWS, there are several AWS technologies to understand as you begin. In this article we’ll discuss two key technologies:

  • AWS Lake Formation for security and governance; and
  • AWS Glue. a data catalog.

While both of these services are typically used to build, manage, and operationalize AWS data lakes, they fulfil completely different roles. AWS Lake Formation is built around AWS Glue, and both services share the same AWS Glue Data Catalog; however, Lake Formation provides a wider breadth of governance and data management functionality, whereas Glue is focused on ETL and data processing.

What is AWS Lake Formation? 

AWS Lake Formation makes it easier for you to build, secure, and manage data lakes. It provides a means to address some of the challenges around unstructured data lake storage – including security, access control, governance, and performance.

How it works: AWS Lake Formation gives you a central console where you can discover data sources, set up transformation jobs to move data to an Amazon Simple Storage Service (S3) data lake, remove duplicates and match records, catalog data for access by analytic tools, configure data access and security policies, and audit and control access from AWS analytic and ML services

For AWS users who want to get governance on their data lake, AWS Lake Formation makes it easy to set up a secure data lake very quickly (in a matter of days). 

In order to provide better query performance when using services such as Athena or Presto, Lake Formation creates Glue workflows that integrates source tables, extract the data, and load it to Amazon S3 data lake.

When should you use AWS Lake Formation? 

At its core, Lake Formation is built to simplify the process of moving your data to a data lake, cataloging the data, and making it available for querying. Typical scenarios where this comes into play include:

  • Build data lakes quickly – this means days not months. You can move, store, update and catalog your data faster, plus automatically organize and optimize your data.
  • Add Authorization on your Data Lake  – You can centrally define and enforce security, governance, and auditing policies.
  • Make data easy to discover and share – Catalog all of your company’s data assets and easily share datasets between consumers.

To understand how this works in practice, check out our article on using Redshift Spectrum in Lake Formation.

What is AWS Glue?

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and join data for analytics, machine learning, and application development. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog which discovers and catalogs metadata about your data stores or data lake.  Using the AWS Glue Data Catalog, users can easily find and access data.

Glue ETL can be used to run managed Apache Spark jobs in order to prepare the data for analytics, perform transformations, compact data, and convert it into columnar formats such as Apache Parquet.

Read more: What’s the difference between Athena and Glue?

When should you use AWS Glue?

To make data in your data lake accessible, some type of data catalog is essential. Glue is often the default option as it’s well-integrated into the broader AWS ecosystem, although you could consider open-source alternatives such as Apache Iceberg. Glue ETL is one option to process data, where alternatives might include running your own Spark cluster on Amazon EMR or using Databricks.

Typical scenarios where you might use include:

  • Create a unified data catalog to find data across multiple data stores – View the Data Catalog to quickly search and discover the datasets that you own, and maintain the relevant metadata in one central repository.
  • Data Catalog for data lake analytics with S3 – Organize, cleanse, validate, and format data for storage in a data warehouse or data lake
  • Build ETL pipelines to ingest data into your S3 data lake. 

The data workflows initiated from AWS Lake Formation blueprint are executed as AWS Glue jobs. You can view and manage these workflows in either the Lake Formation console and the AWS Glue console.

AWS Lake Formation vs AWS Glue: A Summary

AWS Lake formation simplifies security and governance on the Data Lake whereas AWS Glue simplifies the metadata and data discovery for Data Lake Analytics. While both of these services are used as data lake building blocks, they are complimentary. Glue provides basic functionality needed in order to enable analytics, including data cataloging and ETL; Lake Formation offers a simplified way to manage your data lake, including the underlying Glue jobs.

Check out our community roundtable where we discuss how you can build simple data lake with the new stack: Presto + Apache Hudi + AWS Glue and S3 = The PHAS3 stack

Presto platform

Presto Platform Overview

The Presto platform is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes. PrestoDB was developed from the ground up by the engineers at Meta. Currently, some of the world’s most well known, innovative and data-driven companies like Twitter, Uber, Walmart and Netflix depend on Presto for querying data sets ranging from gigabytes to petabytes in size. Facebook , as example, still uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.

The Presto platform was designed and written from scratch for handle interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Airbnb or Twitter.

Presto allows users to effectively query data where it lives. This is including Hive, Cassandra, relational databases, HDFS, object stores, or even proprietary data stores. A single Presto query can combine data from multiple sources. This, in turn, allows for quick and accurate analytics across your entire organization. Presto is an in-memory distributed, parallel system. 

Presto is targeted at data analysts and data scientists who expect response times ranging from sub-second to minutes. The Presto platform breaks the false choice between having fast analytics using an expensive commercial solution or using a slow “free” solution that requires excessive hardware. A single Presto query can combine data from multiple sources. 

The Presto platform is composed of:

  • Two types of Presto servers: coordinators and workers. 
  • One or more connectors: Connectors link Presto to a data source. Examples of such are Hive or a relational database. You can think of a connector the same way you think of a driver for a database. 
  • Cost Based Query Optimizer and Execution Engine. Parser. Planner. Scheduler.
  • Drivers for connecting tools, including JDBC. The Presto-cli tool. The Presto Console. 

In terms of organization the community owned and driven PrestoDB project is supported by the Presto Foundation, an independent nonprofit organization with open and neutral governance, hosted under the Linux Foundation®. Presto software is released under the Apache License 2.0.

Curious about how you can get going with the Presto platform? Ahana offers a managed service for Presto in the cloud. You can get started for free today with the Ahana Community Edition or a free trial for the full edition. The Ahana Community Edition is a free forever version of the Ahana Cloud managed service.

What is an Open Data Lake in the Cloud?

Have you been hearing the term “Open Data Lakehouse” more often? Learn what The Open Data Lake in the cloud actually is, and how it’s a solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Check out this article for more information about data warehouses including their strengths and weaknesses.

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand. However, the pricing is not simple. Amazon Redshift tries to accommodate different use cases, but the pricing model does not fit all users. Learn more about the pricing of Amazon Redshift.

AWS Redshift Query Limits

At its heart, Redshift is an Amazon petabyte-scale data warehouse product. Redshift is based on PostgreSQL version 8.0.2. Learn more about the pros and cons of Amazon Redshift.

Data Lakehouse

Amazon S3 Select Limitations

What is Amazon S3 Select?

Amazon S3 Select allows you to use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. 

Why use Amazon S3 Select?

Instead of pulling the entire dataset and then manually extracting the data that you need,  you can use S3 Select to filter this data at the source (i.e. S3). This reduces the amount of data that Amazon S3 transfers, which reduces the cost, latency, and data processing time at the client.

What formats are supported for S3 Select?

Currently Amazon S3 Select only works on objects stored in CSV, JSON, or Apache Parquet format. The stored objects can be compressed with GZIP or BZIP2 (for CSV and JSON objects only). The returned filtered results can be in CSV or JSON, and you can determine how the records in the result are delimited.

How can I use Amazon S3 Select standalone?

You can perform S3 Select SQL queries using AWS SDKs, the SELECT Object Content REST API, the AWS Command Line Interface (AWS CLI), or the Amazon S3 console. 

What are the limitations of S3 Select?

Amazon S3 Select supports a subset of SQL. For more information about the SQL elements that are supported by Amazon S3 Select, see SQL reference for Amazon S3 Select and S3 Glacier Select.

Additionally, the following limits apply when using Amazon S3 Select:

  • The maximum length of a SQL expression is 256 KB.
  • The maximum length of a record in the input or result is 1 MB.
  • Amazon S3 Select can only emit nested data using the JSON output format.
  • You cannot specify the S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, or REDUCED_REDUNDANCY storage classes. 

Additional limitations apply when using Amazon S3 Select with Parquet objects:

  • Amazon S3 Select supports only columnar compression using GZIP or Snappy.
  • Amazon S3 Select doesn’t support whole-object compression for Parquet objects.
  • Amazon S3 Select doesn’t support Parquet output. You must specify the output format as CSV or JSON.
  • The maximum uncompressed row group size is 256 MB.
  • You must use the data types specified in the object’s schema.
  • Selecting on a repeated field returns only the last value.

What is the difference between S3 Select and Presto?

S3 Select is a minimalistic version of pushdown to source with a limited support for the ANSI SQL Dialect. Presto on the other hand is a comprehensive ANSI SQL compliant query engine that can work with various data sources. Here is a quick comparison table.

ComparisonS3 SelectPresto
SQL DialectFairly LimitedComprehensive
Data Format SupportCSV, JSON, ParquetDelimited, CSV, RCFile, JSON, SequenceFile, ORC, Avro, and Parquet
Data SourcesS3 OnlyVarious (Over 26 open-source connectors)
Push-Down CapabilitiesLimited to supported formatsVaries by format and underlying connector

What is the difference between S3 Select and Athena?

Athena is Amazon’s fully managed service for Presto. As such the comparison between Athena and S3 select is the same as outlined above. For a more detailed understanding of the difference between Athena and Presto see here.

How does S3 Select work with Presto?

S3SelectPushdown can be enabled on your hive catalog as a configuration to enable pushing down projection (SELECT) and predicate (WHERE) processing to S3 Select. With S3SelectPushdown Presto only retrieves the required data from S3 instead of entire S3 objects reducing both latency and network usage.

Should I turn on S3 Select for my workload on Presto? 

S3SelectPushdown is disabled by default and you should enable it in production after proper benchmarking and cost analysis. The performance of S3SelectPushdown depends on the amount of data filtered by the query. Filtering a large number of rows should result in better performance. If the query doesn’t filter any data then pushdown may not add any additional value and the user will be charged for S3 Select requests.

We recommend that you benchmark your workloads with and without S3 Select to see if using it may be suitable for your workload. For more information on S3 Select request cost, please see Amazon S3 Cloud Storage Pricing.

Use the following guidelines to determine if S3 Select is a good fit for your workload:

  • Your query filters out more than half of the original data set.
  • Your query filter predicates use columns that have a data type supported by Presto and S3 Select. The TIMESTAMP, REAL, and DOUBLE data types are not supported by S3 Select Pushdown. We recommend using the decimal data type for numerical data. For more information about supported data types for S3 Select, see the Data Types documentation.
  • Your network connection between Amazon S3 and the Presto cluster has good transfer speed and available bandwidth (For the best performance on AWS, your cluster is ideally colocated in the same region and the VPC is configured to use the S3 Gateway endpoint).
  • Amazon S3 Select does not compress HTTP responses, so the response size may increase for compressed input files.

Additional Considerations and Limitations:

  • Only objects stored in CSV format are supported (Parquet is not supported in Presto via the S3 Select configuration). Objects can be uncompressed or optionally compressed with gzip or bzip2.
  • The “AllowQuotedRecordDelimiters” property is not supported. If this property is specified, the query fails.
  • Amazon S3 server-side encryption with customer-provided encryption keys (SSE-C) and client-side encryption is not supported.
  • S3 Select Pushdown is not a substitute for using columnar or compressed file formats such as ORC and Parquet.

S3 Select makes sense for my workload on Presto, how do I turn it on?

You can enable S3 Select Pushdown using the s3_select_pushdown_enabled Hive session property or using the hive.s3select-pushdown.enabled configuration property. The session property will override the config property, allowing you to enable or disable it on a per-query basis. You may need to turn connection properties such as hive.s3select-pushdown.max-connections depending upon your workload.

lake formation_image

What is AWS Lake Formation?

For AWS users who want to get governance on their data lake, AWS Lake Formation is a service that makes it easy to set up a secure data lake very quickly (in a matter of days), providing a governance layer for Amazon S3. 

We’re seeing more companies move to the data lake because it’s flexible, cheaper, and much easier to use than a data warehouse. You’re not locked into proprietary formats, nor do you have to ingest all of your data into a proprietary technology. As more companies are leveraging the data lake, then security becomes even more important because you have more people needing access to that data and you want to be able to control who sees what. 

AWS Lake Formation can help address security on the data lake. For Amazon S3 users, it’s a seamless integration that allows you to get granular security policies in place on your data. AWS Lake Formation gives you three key capabilities:

  1. Build data lakes quickly – this means days not months. You can move, store, update and catalog your data faster, plus automatically organize and optimize your data.
  2. Simplify security management – You can centrally define and enforce security, governance, and auditing policies.
  3. Make data easy to discover and share – Catalog all of your company’s data assets and easily share datasets between consumers.

If you’re currently using AWS S3 or planning to, we recommend looking at AWS Lake Formation as an easy way to get security policies in place on your data lake. As part of your stack, you’ll also need a query engine that will allow you to get analytics on your data lake. The most popular engine to do that is Presto, an open source SQL query engine built for the data lake.

At Ahana, we’ve made it easy to get started with this stack: AWS S3 + Presto + AWS Lake Formation. We provide SaaS for Presto with out of the box integrations with S3 and Lake Formation, so you can get a full data lake analytics stack up and running in a matter of hours.

AWS lake formation diagram

Check out our webinar where we share more about our integration with AWS Lake Formation and how you can actually enforce security policies across your organization.

Data Lakehouse

How does Presto Work With LDAP?

What is LDAP?

To learn how does Presto work with LDAP, let’s first cover what LDAP is. The Lightweight Directory Access Protocol (LDAP) is an open, vendor-neutral, industry standard application protocol used for directory services authentication. In LDAP user authentication, the LDAP server authenticates users to directly communicate with the Presto server. 

Presto & LDAP

Presto can be configured to enable LDAP authentication over HTTPS for clients, such as the Presto CLI, or the JDBC and ODBC drivers. At present only a simple LDAP authentication mechanism involving username and password is supported. The Presto client sends a username and password to the coordinator and the coordinator validates these credentials using an external LDAP service.

To enable LDAP authentication for Presto, the Presto coordinator configuration file needs to be updated with LDAP-related configurations. No changes are required to the worker configuration; only the communication from the clients to the coordinator is authenticated. However, if you want to secure the communication between Presto nodes then you should configure Secure Internal Communication with SSL/TLS.

Summary of Steps to Configure LDAP Authentication with Presto:

Step 1: Gather configuration details about your LDAP server

Presto requires Secure LDAP (LDAPS), so make sure you have TLS enabled on your LDAP server as well.

Step 2: Configure SSL/TSL on Presto Coordinator

Access to the Presto coordinator must be through HTTPS when using LDAP authentication.

Step 3: Configure Presto Coordinator with config.properties for LDAP

Step 4: Create a Password Authenticator Configuration (etc/password-authenticator.properties) file on the coordinator

Step 5: Configure Client / Presto CLI with either a Java Keystore file or Java Truststore for its TLS configuration.

Step 6: Restart your Presto Cluster and invoke the CLI with LDAP enabled CLI with  either –keystore-* or –truststore-* or both properties to secure TLS connection.

Reference: https://prestodb.io/docs/current/security/ldap.html

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our presentation with AWS on how to get started in 30min with Presto in the cloud.

Apache Ranger plugin-diagram

What is Apache Ranger?

What is Apache Ranger? In a Nutshell

Apache Ranger is a framework to enable, monitor and manage comprehensive data security across the data platform. It is an open-source authorization solution that provides access control and audit capabilities for big data platforms through centralized security administration.

Its open data governance model and plugin architecture enabled the extension of access control to other projects beyond the Hadoop ecosystem, and the platform is widely accepted among “major cloud vendors like AWS, Azure, GCP”. 

With the help of the Apache Ranger console, admins can easily manage centralized, fine-grained access control policies, including file, folder, database, table and column-level policies across all clusters. These policies can be defined at user level, role level or group level.

Apache Service Integration

Apache Ranger uses plugin architecture in order to allow other services to integrate seamlessly with authorization controls.

Apache Ranger plugin diagram

Figure: Simple sequence diagram showing how the Apache Ranger plugin enforces authorization policies with Presto Server.

AR also supports centralized auditing of user access and administrative actions for comprehensive visibility of sensitive data usage through a centralized audit store that tracks all the access requests in real time and supports multiple audit stores including Elasticsearch and Solr.

Many companies are today looking to leverage the Open Data Lake Analytics stack, which is the open and flexible alternative to the data warehouse. In this stack, you have flexibility when it comes to your storage, compute, and security to get SQL on your data lake. With Ahana Cloud, the stack includes AWS S3, Presto, and in this case our AR integration. 

Ahana Cloud for Presto and Apache Ranger

Ahana-managed Presto clusters can take advantage of Ranger Integration to enforce access control policies defined in Apache. Ahana Cloud for Presto enables you to get up and running with the Open Data Lake Analytics stack in 30 minutes. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our on-demand webinar where we share how you can build an Open Data Lake Analytics stack we hosted with Dzone.

Related Articles

What are the differences between Presto and Apache Drill?

Drill is an open source SQL query engine which began life as a paper “Dremel: Interactive Analysis of Web-Scale Datasets”. Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

What Is Trino?

Trino is an apache 2.0 licensed, distributed SQL query engine, which was forked from the original Presto project whose Github repo was called PrestoDB

JSON data

Advanced SQL Tutorial

Advanced SQL: JSON

Advanced SQL queries with JSON

Presto has a wide-range of JSON functions supporting advanced SQL queries. Consider this Json test input data (represented in the query using the  VALUES function) which contains 3 key/value elements. The key is “name” and the value is a dog breed. If we want to select the  the first (0th) key/value pair we would code:

SELECT json_extract(v, '$.dogs) AS all_json, 
  json_extract(v, '$.dogs[0].name') AS name_json, 
  json_extract_scalar(v, '$.dogs[0].name') AS name_scalar 
FROM 
(VALUES JSON ' {"dogs": [{"name": "Beagle"}, {"name": "Collie"}, {"name": "Terrier"}]} ') AS t (v);
 
                         all_json                         | name_json | name_scalar 
----------------------------------------------------------+-----------+-------------
 [{"name":"Beagle"},{"name":"Collie"},{"name":"Terrier"}] | "Beagle"  | Beagle      
(1 row)

All of Presto’s JSON functions can be found at: https://prestodb.io/docs/current/functions/json.html 

Advanced SQL: Arrays, Un-nesting, and Lambda functions 

Consider the following array of test data elements, and simple query to multiple each element by 2:

SELECT elements,
    ARRAY(SELECT v * 2
          FROM UNNEST(elements) AS v) AS my_result
FROM (
    VALUES
        (ARRAY[1, 2]),
        (ARRAY[1, 3, 9]),
        (ARRAY[1, 4, 16, 64])
) AS t(elements);
 
    elements    | my_result
----------------+---------------------
 [1, 2]         | [2, 4]
 [1, 3, 9]      | [2, 6, 18]
 [1, 4, 16, 64] | [2, 8, 32, 128]
(3 rows)

The above advanced SQL query is an example of nested relational algebra which provides an fairly elegant and unified way to query and manipulate nested data. 

Now here’s the same query, but written using a lambda expression. Why use lambda expressions?  This method makes advanced SQL querying nested data less complex and the code simpler to read/develop/debug, especially when logic gets more complicated:

SELECT elements, 
transform(elements, v -> v * 2) as my_result
FROM (
    VALUES
        (ARRAY[1, 2]),
        (ARRAY[1, 3, 9]),
        (ARRAY[1, 4, 16, 64])
) AS t(elements);

Both queries return the same result. The transform function and “x -> y” notation simply means  “do y to my variable x”.

To see more lambda expression examples check out: https://prestodb.io/docs/current/functions/lambda.html 

Advanced SQL: Counting Distinct Values

Running a count(distinct xxx) function is memory intensive and can be slow to execute on larger data sets. This is true for most databases and query engines.  The Presto-cli will even display a warning reminding you of this.  

A useful alternative is to use the approx_distinct function which uses a different algorithm (the HyperLogLog algorithm) to estimate the number of distinct values.  The result is an approximation and the margin of error depends on the cardinality of the data. The approx_distinct function should produce a standard error of up to 2.3% (but it could be higher with unusual data). 
Here’s an example comparing distinct and approx_distinct with a table containing 160.7 million rows.  Data is stored in S3 as Parquet files and the Presto cluster has 4 workers. We can see approx_distinct is more than twice as fast as count(distinct xxx):

presto:amazon> select count(distinct product_id) from review;
 
  _col0   
----------
 21460962 
(1 row)
 
WARNING: COUNT(DISTINCT xxx) can be a very expensive operation when the cardinality is high for xxx. In most scenarios, using approx_distinct instead would be enough
 
 
Query 20201231_154449_00058_npjtk, FINISHED, 4 nodes
Splits: 775 total, 775 done (100.00%)
0:56 [161M rows, 1.02GB] [2.85M rows/s, 18.4MB/s]
 
presto:amazon> select approx_distinct(product_id) from review;
  _col0   
----------
 21567368 
(1 row)
 
Query 20201231_154622_00059_npjtk, FINISHED, 4 nodes
Splits: 647 total, 647 done (100.00%)
0:23 [161M rows, 1.02GB] [7.01M rows/s, 45.4MB/s]

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Learn more about what these data warehouse types are and the benefits they provide to data analytics teams within organizations..

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine, developed by Facebook, for large-scale data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences between Presto and Snowflake in this article.

Presto SQL Engine

How to Manage Presto Queries Running Slow

There are a few reasons that Presto queries might lag or performance is impacted. Your Presto queries might be running slow for a few different reasons. Below we’ll share some things to do in terms of diagnosis and Presto tuning, as well as possible solutions to address the issue. These can help with performance and Presto usage.

Troubleshoot Presto Queries

  1. How many workers do you have in your cluster? If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analyzing, planning and optimizing queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help. You may also run into issues if there’s Presto multiple users.
  2. Have you configured Presto and memory usage correctly?  It is often necessary to change the default memory configuration based on your cluster’s capacity.  The default max memory for a Presto server is 16 GB, but if you have a lot more memory available, you may want to allocate more memory to Presto for better performance. See https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html for configuration details. One rule of thumb:  In each node’s jvm.config, set -Xmx to 80% of the available memory initially, then adjust later based on your monitoring of the workloads.
  3. What kind of instances are your worker nodes running on – do they have enough I/O? Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 type:
Presto queries
  1. Optimize your metadata / data catalog:  Using Presto’s Hive connector for your metastore, like many users do, will mean practically every query will access the Hive metastore for table and partition details etc.  During peak time that generates a high load on the metastore which can slow down query performance. To alleviate this consider:
    • Setup multiple catalogs. Configure PrestoDB to use multiple thrift metastore endpoints – Presto’s Hive connector supports configuring multiple hive metastore endpoints which are tried in round-robin by the coordinator. https://prestodb.io/docs/current/connector/hive.html 
    • Enable Hive metastore and carefully tweak cache eviction configurations and TTLs suitable for your data refresh policies
  1. Do you have a separate coordinator node? When running Presto queries, keep in mind you can have a single node act as both a coordinator and worker, which can be useful for tiny clusters like sandboxes for testing purposes but it’s obviously not optimal in terms of performance.  It is nearly always recommended to have the coordinator running on a separate node to the workers for anything other than sandbox use.  Tip:  Check your nodes’ Presto etc/config.properties files to determine which one is the coordinator (look for coordinator=true)
  2. Is memory exhausted? If so this will delay your Presto queries and affect the performance. Presto uses an in-memory, pipelining processing architecture and its operation is dependent on the available JVM which in turn is dependent on how much memory Presto is configured to use and how much memory is physically available in the server or instance it is running in.   
    • The workers can be memory hungry when processing very large Presto queries. Monitor their memory usage and look for failed queries. Allocate more memory if necessary and switch to using a more memory-rich machine if practical. 
    • The coordinator should be allocated a significant amount of memory – often more than a worker – depending on several factors like workload, the resources available, etc.  It’s not uncommon to see the coordinator alone consuming several tens’ of GBs of memory. 
    • The good news is there is memory information available in at least two places:
      • Presto’s built-in JMX catalog can help your monitor memory usage with various counters.  Read more about memory pools, limits and counters at https://prestodb.io/blog/2019/08/19/memory-tracking
      • There is also the Presto Console which reveals, for each query, the reserved, peak and cumulative memory usage.
  1. When was the last time you restarted your Presto cluster? Sometimes, restarting any kind of  software can solve all sorts of issues, including memory leaks and garbage collection. This in turn can increase the speed of your Presto queries.
  2. Is your Presto cluster configured for autoscaling based on CPU usage?  If so check the configuration is what you expect it to be. This fix can resolve the performance of your Presto queries on your data.
  3. Does IO and CPU utilization look balanced?  Check CPU usage on Presto workers: if their CPUs are not fully saturated, it might indicate the number of Presto worker threads can be made higher, or the number of splits in a batch is not high enough.
  4. Have you checked your data volumes recently? An obvious one to check but data volumes can grow in fits and starts and sometimes peaks occur  unexpectedly. The Presto queries may simply be taking longer because there is x% more data than last month. This increase can result in slowing Presto queries.

Other configuration settings for Task concurrency, initial splits per node, join strategy, driver tasks… PrestoDB has around 82 system configurations and 50+ hive configuration settings which users can tweak, many at the query level. These are however for advanced users, which falls outside the scope of this article. Making alterations here, if you are not careful, can affect the speed of your Presto queries. More information can be found in the PrestoDB documentation.

As you can tell, there’s a lot to configure and tune when it comes to addressing Presto performance issues. To make it easier, you can use Ahana Cloud, SaaS for Presto. It’s available in the AWS Marketplace and is pay as you go. Check out our free trial at https://ahana.io/sign-up


Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Learn more about what these data warehouse types are and the benefits they provide to data analytics teams within organizations..

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine, developed by Facebook, for large-scale data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences between Presto and Snowflake in this article.

Presto functions: mathematical, operators and aggregate

Data Lakehouse

Presto offers several classes of mathematical functions that operate on single values and mathematical operators that allow for operations on values across columns. In addition, aggregate functions can operator on a set of values to compute a single result.

The mathematical functions are broken into four subcategories: 1. mathematical, 2. statistical, 3. trigonometric, and 4. floating point. The majority fall into the mathematical category and we’ll discuss them separately. The statistical functions are quite sparse with two functions that compute the lower and upper bound of the Wilson score interval of a Bernoulli process. The trigonometric functions are what you’d expect (e.g. sin, cos, tan, etc.). The floating point functions are really functions that handle not-a-number and infinite use cases.

The mathematical functions subcategory further fall into another layer of classification:

  1. Functions that perform coarser approximation, such as rounding and truncation: abs, ceiling (ceil), floor, round, sign, truncate
  2. Conversions: degrees, radians, from_base, to_base
  3. Exponents, logarithms, roots: exp, ln, log2, log10, power (pow), cbrt, sqrt
  4. Convenient constants, such as pi(), e(), random (rand)
  5. Cumulative distribution functions (and inverses):binomial_cdf, inverse_binomial_cdf, cauchy_cdf, inverse_cauchy_cdf, chi_squared_cdf, inverse_chi_squared_cdf, normal_cdf, inverse_normal_cdf, poisson_cdf, inverse_poisson_cdf, weibull_cdf, inverse_weibull_cdf, beta_cdf, inverse_beta_cdf, width_bucket
  6. Miscellaneous: mod, cosine_similarity

The mathematical operators are basic arithmetic operators, such as addition (+), subtraction (-), multiplication (*), and modulus (%).

Let’s apply these mathematical functions in an example. In the following query, have a floating-point column x to which we apply several mathematical functions that are representative of the subcategories we discussed previously, including: radians (conversion), natural log, the Normal CDF, modulo, random number, and operators.

select
	x,
	radians(x) as radians_x,							/* convert to radians */
	ln(x) as ln_x,												/* natural log */
	normal_cdf(0,30,x) as_normal_cdf_x,		/* Normal CDF */
	mod(x,2) as mod_x_2,									/* Modulo 2 */
	random() as r,												/* Random number */
	3*((x/2)+2) as formula								/* Formula using operators */
from
	example;

The following is the output the above query with some rounding for ease of viewing.

Output of query with rounding

So, far we see that mathematical functions, as they are classified in Presto, operate on single values. What this means is that given a column of values, each function is applied element-wise to that column. Aggregate functions allow us to look across a set of values.

Like mathematical functions, aggregate functions are also broken into subcategories: 1. general, 2. bitwise, 3. map, 4. approximate, 5. statistical, 6. classification metrics, and 7. differential entropy. We will discuss the general and approximate subcategory separately.

The bitwise aggregate functions are two functions that return the bitwise AND and bitwise OR or all input values in 2’s complement representation. The map aggregate functions provide convenient map creation functions from input values. The statistical aggregate functions are standard summary statistic functions you would expect, such as stddev, variance, kurtosis, and skewness. The classification metrics and differential entropy aggregate functions are specialized functions that make it easy to analyze binary classification predictive modelling and model binary differential entropy, respectively.

The general functions subcategory further fall into another layer of classification:

  1. Common summarizations: count, count_if, min, max, min_by, max_by, sum, avg, geometric_mean, checksum
  2. Boolean tests: bool_or, bool_and, every
  3. Data structure consolidation: array_agg, set_agg, set_union
  4. Miscellaneous: reduce_agg, arbitrary

Again, let’s apply these aggregate functions in a series of representative examples. In the following query, we apply a series of basic aggregations to our floating-point column x .

select
	sum(x) as sum_x,
	count(x) as count_x,
	min(x) as min_x,
	max(x) as max_x,
	avg(x) as avg_x,
	checksum(x) as ckh_x
from
	example;

The following is the output the above query.

SQL query output

In the following query, we showcase a boolean test with the bool_or function. We know that the natural log will return a NaN for negative values of x. So, if we apply the is_nan check, we expect x to always be false, but for our ln result to occasionally be true. Finally, if we were to do the bool_or aggregation on our is_nan functions, we expect the column derived from x to be false (i.e. no true at all) and the column derived fro ln(x) to be true (i.e. at least one true value). The following query and accompanying result illustrate this.

with nan_test as (
	select
		is_nan(x) as is_nan_x,
		is_nan(ln(x)) as is_nan_ln_x
	from
		example
)
select
	bool_or(is_nan_x) as any_nan_x_true,
	bool_or(is_nan_ln_x) as any_nan_ln_x_true
from
	nan_test;

Presto SQL query output

This final example illustrates the use of an example of data consolidation, taking a x and radians(x) columns and creating a single row with a map data structure.

with rad as(select x, radians(x) as rad_x from example)
select map_agg(x, rad_x) from rad;

query output

The approximate aggregate functions provide approximate results for aggregate large data sets, such as distinct values (approx_distinct), percentiles (approx_percentile), and histograms (numeric_histogram). In fact, we have a short answer post on how to use the approx_percentile function. Several of the approximate aggregate functions rely on other functions and data structures: quantile digest, HyperLogLog and KHyperLogLog.

A natural extension to aggregate functions are window functions, which perform calculations across rows of a query results. In fact, all aggregate functions can be used as window functions by adding an OVER clause. One popular application of window functions is time-series analysis. In particular, the lag function window function is quite useful. We have a short answer post on how to use the lag window function and to compute differences in dates using the lag window function.

This was short article was a high-level overview, and you are encouraged to review the Presto public documentation for Mathematical Functions and Operations and Aggregate Functions. If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

What is a Presto lag example?

Data Lakehouse

The Presto lag function a window function that returns the value of an offset before the current row in a window. One common use case for the lag function is with time series analysis, such as autocorrelation.

Figure 1 shows the advert table of sales and advertising expenditure from Makridakis, Wheelwright and Hyndman (1998) Forecasting: methods and applications, John Wiley & Sons: New York. The advert column is the monthly advertising expenditure, and the sales column is the monthly sales volume.

/

A simple analysis could be to track the difference between the current month’s sales volume and the previous one, which is shown in Figure 2. The lag_1_sales column is a single period lagged value of the sales column, and the diff column is the difference between sales and lag_1_sales. To generate the table in Figure 2, we can use the lag function and the following query:

select
  advert,
  sales,
  lag_1_sales,
  round(sales - lag_1_sales,2) as diff
from (
  select
    advert,
    sales,
    lag(sales, 1) over(range unbounded preceding) as lag_1_sales
  from advert
);

The subquery uses the lag function to get a one period offset preceding value of the sales column, where the OVER clause syntax is specifying the window. The main query then computes the diff column. Here are a couple of additional useful notes about the lag function:

  1. You can change the offset with the second argument lag(x, OFFSET), where OFFSET is any scalar expression. The current row is OFFSET=1.
  2. By default, if an offset value is null or outside the specified window, a NULL value is used. We can see this in the first row of the table in Figure 2. However, the default value to use in these cases is configurable with an optional third argument lag(x, OFFSET, DEFAULT_VALUE), where DEFAULT_VALUE is desired value.

A closely related function is the lead function returns the value at an offset after the current row.

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

How do I get the date_diff from previous rows?

To find the difference in time between consecutive dates in a result set, Presto offers window functions. Take the example table below which contains sample data of users who watched movies.

Example:

select * from movies.ratings_csv limit 10;

Screen Shot 2021 08 24 at 4.11.28 PM

select userid, date_diff('day', timestamp, lag(timestamp) over (partition by userid order by  timestamp desc)) as timediff from ratings_csv order by userid desc limit 10;

Screen Shot 2021 08 24 at 4.14.36 PM

The lag(x, y, start, end) function fetches the value of column x at row offset y and calculates the difference. When no offset is provided, the default value is 1 (previous row). Notice, that the first row in timediff is NULL due to not having a previous row.

How do I use the approx_percentile function in Presto?

percentile

The Presto approx_percentile is one of the approximate aggregate functions, and it returns an approximate percentile for a set of values (e.g. column). In this short article, we will explain how to use the approx_percentile function.

What is a percentile?

percentile function

From Wikipedia:

In statistics, a percentile (or a centile) is a score below which a given percentage of scores in its frequency distribution falls (exclusive definition) or a score at or below which a given percentage falls (inclusive definition)

To apply this, we’ll walk through an example with data points from a known, and arguably most famous, distribution—-the Normal (or Gaussian) distribution. The adjacent diagram plots the density of a Normal distribution with a mean of 100 and standard deviation of 10.

If we were to sample data points from this Normal distribution, we know that approximately half of the data points would be less that the mean and half of the data points would be above the the mean. Hence, the mean, or 100 in this case, would be the 50th percentile for the data. It turns out that the 90th percentile would approximately be 112.82; this means that 90% of the data points are less than 112.82.

approx_percentile by example

To solidify our understanding of this and the approx_percentile function, we’ve created a few tables to use as example:

presto:default> show tables;
    Table
-------------
 dummy
 norm_0_1
 norm_100_10
 norm_all
(4 rows)
TableDescriptionNumber of Rows
dummySingle column 100 row table of all ones except for a single value of 100.100
norm_0_1Samples from normal distribution with mean of 0 and standard deviation of 1.5000
norm_100_10Samples from normal distribution with mean of 100 and standard deviation of 10.5000
norm_allCoalescence of all normal distribution tables.10000
Table 1

The approx_percentile function has eight type signatures. You are encouraged to review the Presto public documentation for all the function variants and official descriptions. The set of values (e.g. column) is a required parameter and is always the first argument.

Another required parameter is the percentage parameter, which indicates the percentage or percentages for the returned approximate percentile. The percentage(s) must be specified as a number between zero and one. The percentage parameter can either be the second or third argument of the function, depending on the intended signature. In the following examples, the percentage parameter will be the second argument. For example, approx_percentile(x,0.5) will return the approximate percentile for column x at 50%. For data points in our norm_100_10 table, we expect the returned value to be around 100.

presto:default> select approx_percentile(x,0.5) from norm_100_10;
      _col0
------------------
 99.8184647799587
(1 row)

approx_percentile(x,0.9) will return the approximate percentile for column x at 90%, which for the data in norm_100_10 table should be around 112.82.

presto:default> select approx_percentile(x,0.9) from norm_100_10;
      _col0
------------------
 112.692881777202
(1 row)

In a single query, you can also specify an array of percentages to compute percentiles: approx_percentile(x,ARRAY[0.5, 0.9]).

presto:default> select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_100_10;
                _col0
--------------------------------------
 [99.8184647799587, 112.692881777202]
(1 row)

We can ask for multiple percentages for our dummy table, which consists of a 100 rows of all ones except for a single value of 100. Hence, we expect all percentiles below 99% to 1.

presto:default> select approx_percentile(x,ARRAY[0.1, 0.5, 0.98, 0.99]) from dummy;
         _col0
------------------------
 [1.0, 1.0, 1.0, 100.0]
(1 row)

We can also use a GROUP BY clause to segment the values to compute percentiles over. To illustrate this, we will use our norm_all table, which contains values from both the norm_100_10 and the norm_0_1 tables. The m and sd columns specify the mean and standard deviation of the normal distribution the corresponding x value is sampled from.

presto:default> select m, sd, x from norm_all order by rand() limit 10;
  m  | sd |         x
-----+----+--------------------
 0   | 1  | -0.540796486700647
 0   | 1  |   0.81148151337731
 0   | 1  |   1.28976310661005
 100 | 10 |   97.0272872801269
 100 | 10 |   83.1392343835652
 0   | 1  | -0.585678877703149
 0   | 1  |  0.268589447255106
 0   | 1  | -0.280908719376113
 100 | 10 |    104.36328077332
 0   | 1  |  0.266294347905949
(10 rows)

The following query then will return approximate score for 50% and 90% for data points grouped by the same values of m and sd (i.e. from the same normal distribution): select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_all group by grouping sets ((m,sd)). As expected, we see that the approximate 50% and 90% percentile is around 100 and 112.82 for a mean of 100 and standard deviation of 10 and around 0 and 1.28 for a mean of 0 and standard deviation of 1.

presto:default> select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_all group by grouping sets ((m,sd));
                  _col0
------------------------------------------
 [99.8563616481321, 112.879972343696]
 [-0.00458419083839064, 1.30949677294588]
(2 rows)

An optional parameter is accuracy, which controls the maximum rank error and defaults to 0.01. The value of accuracy must be between zero and one (exclusive) and must be constant for all input rows. We can add accuracy as the third argument to our function.

For example, approx_percentile(x,0.9,0.5) will return the approximate score for column x at 90% with 0.5 maximum rank error accuracy. By allowing for larger error (from the 0.01 default), we can see the approximate score of 113.50 is farther away from our true value of 112.82 than our previous result of 112.88.

presto:default> select approx_percentile(x,0.9,0.5) from norm_100_10;
       _col0
--------------------
 113.49999999999999
(1 row)

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

Can I write back or update data in my Hadoop / Apache Hive cluster through Presto?

Using Presto with a Hadoop cluster for SQL analytics is pretty common especially in on premise deployments. 

XH3eNJyksIZgh paiVECqDQikAxUAcncx cC6FIyZEtBMLuXPzvfIp2qD21MAGC0Uyzf8fTPLkYqjVzvPjC4EI7JMNi1wD3umWdcQz7AFK Ptz8pIa3Qvsj24C4TjLkqTfbbjwBn
6wl4fYwoXx EzI0 PeQXr4wVN7uOBOyzpboxv7rJij7hrztdka4uBYyl hgTO 328iCWQFOFnsnckkGXdesV6C4dSPh9Yv stAHcg5EIEdXJhkgM7eI8OoSsLwhotIe1kDWpKNK

With Presto, you can read and query data from the Hadoop datanodes but you can also make changes to data in Hadoop HDFS. There are however some restrictions. 

All this is enabled via Presto’s Hive Connector. 

The first step is to create a catalog properties file and point to the Hive Metastore. 

You can also optionally configure some Hive Metastore properties for the Hive Connector. 

Create etc/catalog/hive.properties with the following contents to mount the hive-hadoop2 connector as the hive catalog, replacing example.net:9083 with the correct host and port for your Hive metastore Thrift service:

connector.name=hive-hadoop2

hive.metastore.uri=thrift://example.net:9083

For basic setups, Presto configures the HDFS client automatically and does not require any configuration files. 

Creating a new table in HDFS via Hive

Using Presto you can create new tables via the Hive Metastore. 

Examples

Create table: 

Create a new Hive table named page_views in the web schema that is stored using the PARQUET file format, partitioned by date and country. HIVE is the name of the connector – that is the name of the properties file. 

CREATE TABLE hive.web.page_views (

  view_time timestamp,

  user_id bigint,

  page_url varchar,

  ds date,

  country varchar

)

WITH (

  format = 'ORC',

  partitioned_by = ARRAY['ds', 'country']

)

Deleting data from Hive / Hadoop

With the Hive connector, you can delete data but this has to be at the granularity of entire partitions. 

Example: 

Drop a partition from the page_views table:

DELETE FROM hive.web.page_views

WHERE ds = DATE '2016-08-09'

  AND country = 'US'

Drop the external table request_logs. This only drops the metadata for the table. The referenced data directory is not deleted:

DROP TABLE hive.web.request_logs

Drop a schema:

DROP SCHEMA hive.web

Hive Connector Limitations

  • DELETE is only supported if the WHERE clause matches entire partitions.
  • UPDATE is not supported from Presto

How do I convert Unix Epoch time to a date or something more human readable with SQL?

Many times the Unix Epoch Time gets stored in the database. But this is not very human readable and conversion is required for reports and dashboards. 

Example of Unix Epoch Time: 

1529853245

Presto provides many date time functions to help with conversion. 

In case of a Unix Epoch Time, the from_unixtime function can be used to convert the Epoch time. 

This function returns a timestamp. 

from_unixtime(unixtime) → timestamp
Returns the UNIX timestamp unixtime as a timestamp.


After converting the Unix Epoch time to a timestamp, you can cast it into other formats as needed such as extracting just the date. Examples follow below. 


Examples: 

Query

select from_unixtime(1529853245) as timestamp;

Result

timestamp

2018-06-24 15:14:05.000

Query

select cast(from_unixtime(1529853245) as date) as date;

Result

date

2018-06-24
More examples and information can be found here: https://ahana.io/answers/how-to-convert-date-string-to-date-format-in-presto/

How do I transfer data from a Hadoop / Hive cluster to a Presto cluster?

Hadoop is a system that manages both compute and data together. Hadoop cluster nodes have the HDFS file system and may also have different types of engines like Apache Hive, Impala or others running on the same or different nodes. 

TILiImjDekPnfzldykr01BNcYyfjH6GlXbnbQfXIBwAR7KtOAxQtjvwfhHMcJ0HCgoiiQxoR263y5lzVlEjNAoMnB8JSzergi0

In comparison, Presto, an open source SQL engine built for data lakes, is only a query engine. This means that it does not manage its own data. It can query data sitting in other places like HDFS or in cloud data lakes like AWS S3. 

Because of this there is no data transfer or ingestion required into Presto for data that is already residing in an HDFS cluster. Presto’s Hive Connector was specifically designed to access data in HDFS and query it in Presto. The Hive connector needs to be configured with the right set of config properties 

mQIqSH50ilPf6iWgPE kmvg1IOdmPnSfO4TpGkDa8k3WGahFf VRaQrVDzPOemOx0m32kCOaEvx ZKRApeCwlQgb1wp YFRVOEZzsLb2Vp8 WUb 3hnGYNKN6Ew6pkCk5c8QDHuN

The Presto Hive connector supports Apache Hadoop 2.x and derivative distributions including Cloudera CDH 5 and Hortonworks Data Platform (HDP).

Create etc/catalog/hive.properties with the following contents to mount the hive-hadoop2 connector as the hive catalog, replacing example.net:9083 with the correct host and port for your Hive metastore Thrift service:

connector.name=hive-hadoop2

hive.metastore.uri=thrift://example.net:9083

For basic setups, Presto configures the HDFS client automatically and does not require any configuration files. In some cases, such as when using federated HDFS or NameNode high availability, it is necessary to specify additional HDFS client options in order to access your HDFS cluster. To do so, add the hive.config.resources property to reference your HDFS config files:

hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml

Presto substring operations: How do I get the X characters from a string of a known length?

Presto provides an overloaded substring function to extract characters from a string. We will use the string “Presto String Operations” to demonstrate the use of this function.

Extract last 7 characters:

presto> SELECT substring('Presto String Operations',-7) as result;

 result

---------

 rations

(1 row)

Query 20210706_225327_00014_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Extract last 10 characters:

presto> SELECT substring('Presto String Operations',-10) as result;

   result

------------

 Operations

(1 row)

Query 20210706_225431_00015_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Extract the middle portion of the string:

presto> SELECT substring('Presto String Operations',8,6) as result;

 result

--------

 String

(1 row)

Query 20210706_225649_00020_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:01 [0 rows, 0B] [0 rows/s, 0B/s]

Extract the beginning portion of the string:

presto> SELECT substring('Presto String Operations',1,6) as result;

 result

--------

 Presto

(1 row)

Query 20210706_225949_00021_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Data Lakehouse

What is Spark SQL?

Spark is a general purpose computation engine for large-scale data processing. At Spark’s inception, the primary abstraction was a resilient distributed dataset (RDD), an immutable distributed collection of data. Since then, higher level abstractions—called DataFrames and Datasets—that more closely resemble classic database tables have been introduced to work with structured data. Spark SQL is the Spark module for working with these abstractions and structured data.

In addition to DataFrames and Datasets, Spark SQL also exposes SQL to interact with data stores and DataFrames/Datasets. For example, let’s say we wanted to return all records from a table called people with the basic SQL query: SELECT * FROM people. To do so with Spark SQL, we could programmatically express this in Python as follows:

people_dataframe = spark.sql(“SELECT * FROM people”)

spark is a SparkSession class, the main Spark entry point for structured data abstractions, and the statement would return a Spark DataFrame. With the SQL API, you can express SQL queries and get back DataFrames, and from DataFrames, you can create tables by which you can execute SQL queries on top of. Because the SQL language is widely known, it allows a broader range of data practitioner personas, such as SQL analysts, to perform data processing on top of Spark.

>> Learn more about using Spark’s execution engine with Presto.

Since Spark 3.0, Spark SQL introduced experimental options to be strictly ANSI compliant instead of being Hive compliant. Prior, Spark SQL supported both ANSI SQL and HiveQL.  Please consult the official Apache Spark SQL Reference if you are interested in the specifics of supported syntax, semantics, and keywords.

Regardless of whether you express data processing directly with DataFrame/Dataset methods or SQL, Spark SQL runs the same execution engine under the hood.  Further, through Spark SQL, the structured nature of the data and processing provide additional context to Spark about the data itself—such as the column types—and the workload.  This additional context allows for additional optimization, often resulting in better performance.

While Spark SQL is a general-purpose engine, you might want to consider Presto if your target use cases are predominantly interactive, low-latency queries on structured data. We compare Spark SQL and Presto in this short article.

ahana mark color small

How do I query a data lake with Presto?

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. Structured and semi-structured data can be queried by Presto, an open source SQL engine. This allows users to store data as-is, without having to first structure the data, and run different types of analytics. 

To query this data in data lakes, the following technologies are needed. 

  1. A SQL query engine – Presto was built for querying data lakes like HDFS and now increasingly AWS S3 and Google Cloud Platform – Google Cloud Storage and others. 
  2. A big data catalog – there are two popular big data catalog systems also called metastores – the Hive Metastore and AWS Glue service. 
  3. Buckets in the data lake like AWS S3 
use Presto to query a data lake

What types of data can be queried by Presto? 

The following file formats can be queried by Presto 

  1. ORC.
  2. Parquet.
  3. Avro.
  4. RCFile.
  5. SequenceFile.
  6. JSON.
  7. Text.
  8. CSV

How does it work? 

First, data in the data lake needs to be mapped into tables and columns. This is what the Hive Metastore and AWS Glue catalogs help with. Example, if there is a CSV file, once Presto, Hive Metastore and Glue are integrated, users can use Presto commands to create a schema and then the create table statement and map the file to a table and columns. 

Presto SQL

Example:

SHOW CATALOGS; 
USE ahana_hive.default; 

CREATE TABLE user ( 
	registration_dttm 	timestamp, 
	id 					int,
	first_name 			varchar,
	last_name 			varchar,
	email 				varchar,
	gender 				varchar,
	ip_address 			varchar,
	cc 					varchar,
	country 			varchar,
	birthdate 			varchar,
	salary 				double,
	title 				varchar,
	comments 			varchar
) WITH ( 
format = CSV, 
skip_header_line_count = 1,
external_LOCATION = 's3a://ahana/userdata/' );

Once the table is created and mapped to the external location, it can immediately be queried. 

Example: 

Select * from user; 

You can run these commands in Presto using the Presto-cli. More information in the docs.

If you are looking for better performance, it is recommended to convert formats like JSON, CSV into more optimized formats like Apache Parquet and Apache ORC. This will improve query performance greatly. This can also be done with Presto using the CREATE TABLE AS command. More information on this here

Ahana Cloud makes it very easy to query a data lake with Presto. It is a managed service for Presto and also comes with a built-in Hive Metastore so that you don’t need to deploy and manage one. In addition, it can also integrate with AWS Glue. Getting started with Ahana Cloud is easy. Here’s how: https://ahana.io/docs/getting-started

Additional resources: 

Presto Docker Container 
Presto-cli Docker Container 
Presto Sandbox Docker Container 

AWS EMR Presto

Advice: Want to test out Presto outside of AWS and EMR? Learn how you can run Presto as the SQL engine to your Open Data Lakehouse with the forever free offering of Ahana Community Edition.

Why am I getting a Presto EMR S3 timeout error?

If you are making use of AWS EMR Presto and you want to understand this error better there are a few places you can start your investigation. First, you can use the S3 select pushdown feature to push down compute operations (i.e. SELECT) and predicate operations (i.e. WHERE) to S3. Pushdown makes query performance much noticeably faster. This is because it means queries will only retrieve the required data from S3; so, there is essentially less to slow down or hinder the performance of the SQL query. It also helps in reducing the amount of data transferred between EMR Presto and S3.

AWS EMR Presto

Now, If you are using pushdown for EMR Presto and seeing a timeout error, there might be a few reasons for that. Because Presto uses EMRFS as its file system, there’s a maximum allowable number of client connections to S3 through EMRFS for Presto (500). When using S3 Select Pushdown, you bypass EMRFS when you access S3 for predicate operations so the value of hive.s3select-pushdown.max-connections is what will determine the max number of client connections allowed by worker nodes. Requests that aren’t pushed down use the value of fs.s3.maxConnections.

At this point you might get an error that says “timeout waiting for connection from pool”. That is because you need to increase the value of both of those values above. Once you do that, and have this completed, that should help solve this problem.

Errors like these are more than common with Presto EMR, as reported by EMR users. EMR is a very complex and resource-intensive option. Because of this, there is a lot you have to understand and be prepared for when it comes to the specific config and turning parameters for Hadoop. Because of the complexity associated, many companies have now switched from EMR Presto to Ahana Cloud. These users thought to save time and reduce work fatigue by making the switch to a managed service for Presto on AWS through Ahana. This path is much easier for the user and requires less of their energy to get up and running. Ahana Cloud is a non-Hadoop deployment in a fully managed environment. Users see up to 23x performance with Ahana’s built-in caching. The users have also reported a 5.5x increase in price performance. Now, they are achieving goals with the same or better performance, and at a fraction of the cost.

Additional EMR Presto Resources

Check out some of the differences between Presto EMR and Ahana Cloud. If you’re using EMR Presto today, Ahana Cloud might help with some of those pain points. Additionally, Ahana is pay-as-you-go pricing and it’s easy to get started if you’re already an EMR user. 


Related Articles

What is Presto?

Want to take a deeper dive into Presto? Learn what PrestoDB is, how it started, and the benefits users report.

How to Build a Data Lake Using Lake Formation on AWS

AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach.

5 Components of Data Warehouse Architecture

In this article we’ll look at the contextual requirements of a data warehouse, which are the five components of a data warehouse. We’ll break down the pros and cons and compare it to a data lake.

Running Presto Queries – Do I need to move data first?

Presto Queries

Do I need to move data when running Presto queries?

No, when you are running Presto queries your data stays in-place for the SQL query. Because of this, you will not need to move your data before executing Presto queries. If you’re using AWS S3 for your data lake, for example, you wouldn’t need to ingest it to query as you would if you were using a data warehouse like AWS Redshift. 

To bring Presto compute to your data, you can leverage Ahana Cloud.

With Ahana Cloud, it’s very easy to leverage the power of Presto to query AWS S3. You just connect your data source to Ahana and everything continues to run in your cloud account (called in-vpc). It’s just a click of a button to add data sources to Ahana Cloud for querying. Ahana Cloud would be a replacement for Amazon Athena, EMR Presto, or in the event that you’re running Presto on your own in AWS. It’s a managed service for Presto that takes care of all the configuration, tuning, deployment, managing, attaching/detaching data sources, etc.
You can learn more about Ahana Cloud, and you can also sign up to trial it too.

running Presto queries

Data Warehouse: A Comprehensive Guide

A data warehouse is a data repository, typically used for analytic systems and Business Intelligence tools. Take a look at this article to get a better understand of what it is and how it’s used.

Data Warehouse Concepts for Beginners

A relational database that is designed for query and analysis rather than for transaction processing. Learn more here.

How do I sync my partition and metastore in Presto?

Sync partition metadata is used to sync the metastore with information on the file system/s3 for the external table. Depending upon the number of partitions the sync can take time.

Here is a quick reference from the presto docs: https://prestodb.io/docs/current/connector/hive.html?highlight=sync_partition_metadata

Procedures:

  • system.create_empty_partition(schema_name, table_name, partition_columns, partition_values)
    Create an empty partition in the specified table.
  • system.sync_partition_metadata(schema_name, table_name, mode, case_sensitive)
    Check and update partitions list in metastore. There are three modes available:
    • ADD : add any partitions that exist on the file system but not in the metastore.
    • DROP: drop any partitions that exist in the metastore but not on the file system.
    • FULL: perform both ADD and DROP.

The case_sensitive argument is optional. The default value is true for compatibility with Hive’s MSCK REPAIR TABLE behavior, which expects the partition column names in file system paths to use lowercase (e.g. col_x=SomeValue). Partitions on the file system not conforming to this convention are ignored, unless the argument is set to false.

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

ctas

How do I run a CTAS (Create Table As) with a Query?

There are a few different ways to run a CTAS with a Query in Presto. Below we’ll lay those out.

OPTION 1: Managed Schema
Managed Schema Creation:
CREATE SCHEMA managed_schema WITH (location = 's3a://mybucket/managed_schema/');

Create Table AS with Managed Schema:

CREATE TABLE managed_movies_orc
COMMENT 'Managed Movies ORC'
WITH (format = 'ORC')
AS
select * from "ahana-demo".ahana_s3_movies
CTAS

OPTION 2: External Table 
Technically you can’t do a create table as with an external table you will get the error:

presto error: External tables cannot be created using CREATE TABLE AS

You can however do a create table followed by insert into once the hive.non-managed-table-writes-enabled is set to true
https://prestodb.io/docs/current/connector/hive.html

CREATE TABLE <your schema>.ext_table_movies_orc( movieid bigint, title varchar, genres varchar ) 
WITH 
( format = 'ORC',
  external_location= 's3a://mybucket/externa_table/'
)

insert into <your schema>.ext_table_movies_orc select * from "ahana-demo".ahana_s3_movies

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

Presto SQL Engine

What is the difference between a managed table and external tables?

Advice: Looking for more information about understanding PrestoDB and its components? Grab a free download of the O’Reilly ebook, Learning and Operating Presto.

Managed Table & External Tables: Main Difference

There are a few differences between these. However, the main difference between a managed and external table is that when you drop an external table, the underlying data files stay intact. This is because the user is expected to independently manage the data files and directories. With a managed table, the underlying directories and data get wiped out when the table is dropped. See the additional differences between the two and learn when to use each below.

Managed tables with presto


Managed Table & External Table: Within Presto

Regarding Presto tables are developed as such:

External Table: Table created using WITH has ‘external_location’
Managed Table: Table created in schema which has WITH used has ‘location’

You cannot “insert into” an external table (By default, the setting hive.non-managed-table-writes-enabled=false prevents you from doing so). 

The expectation that the data and information in the external table is managed externally. e.g. Spark, Hadoop, Python Scripts, or another external ETL process.

Internal Tables vs. External Tables in Apache Hive

Below are the major differences between Internal vs External tables in Apache Hive:

INTERNAL OR MANAGED TABLEEXTERNAL TABLE
By default, Hive creates an Internal or Managed Table.Use EXTERNAL option/clause to create an external table
Hive owns the metadata, table data by managing the lifecycle of the tableHive manages the table metadata but not the underlying file.
Dropping an Internal table drops metadata from Hive Metastore and files from HDFSDropping an external table drops just metadata from Metastore without touching the actual file on HDFS/S3
Metadata on Inserts, creation of new partitions, etc. are updated automatically during inserts through the metastoreYou need to explicitly run sync_partitions to sync changes on S3 with the metastore

Conclusion

There tends to be a sense of confusion regarding when to pick managed tables and when to decide to use external tables. However, you can rely on this rule of thumb for future planning.

In short, you should opt to use managed tables when the metastore should manage the lifecycle of the table. Also, when you are generating temporary tables. However, you should use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management, and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace. Give it a try and you’ll be up and running Presto in under an hour.

Related Articles

What is Presto?

Want to take a deeper dive into understanding Presto. Learn what PrestoDB is, how it got started, and the benefits for Presto users.

How to Build a Data Lake Using Lake Formation on AWS

AWS lake formation helps users to build, manage, and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach. Learn more about AWS Lake Formation, including the pros and cons of AWS Lake Formation.

Data Warehouse: A Comprehensive Guide

A data warehouse is a data repository, typically used for analytic systems and Business Intelligence tools. Take a look at this article to get a better understand of what it is, how it’s used, and the pros and cons of a data warehouse compared to a data lake.

Data Lakehouse

What are Presto FAQ’s?

What is Presto and Presto’s FAQ’s?

To start addressing the Presto FAQ’s, Presto is a distributed SQL query engine written in Java. It takes any query written in SQL, analyzes the query, creates and schedules a query plan on a cluster of worker machines which are connected to data sources, and then returns the query results. The query plan may have a number of execution stages depending on the query. For example, if your query is joining together many large tables, it may need multiple stages to execute, aggregating tables together. After each execution stage there may be intermediate data sets. You can think of those intermediate answers like your scratch‐ pad for a long calculus problems.

Presto FAQ 1
Why is Presto so fast?

In the past, distributed query engines like Hive were designed to persist intermediate results to disk. As the below figure illustrates, Presto saves time by executing the queries in the memory of the worker machines, including performing operations on intermediate datasets there, instead of persisting them to disk. The data can reside in HDFS or any database or any data lake, and Presto performs the executions in-memory across your workers, shuffling data between workers as needed. Avoiding the need for writing and reading from disk between stages ultimately speeds up the query execution time. Hive intermediate data sets are persisted to disk. Presto executes tasks in-memory.

If this distributed in-memory model sounds familiar, that’s because Apache Spark uses the same basic concept to effectively replace MapReduce-based technologies. However, Spark and Presto manage stages differently. In Spark, data needs to be fully processed before passing to the next stage. Presto uses a pipeline processing approach and doesn’t need to wait for an entire stage to finish.
Presto was developed with the following design considerations:
• High performance with in-memory execution
• High scalability from 1 to 1000s of workers
• Flexibility to support a wide range of SQL use cases
• Highly pluggable architecture that makes it easy to extend Presto with custom integrations for security, event listeners, etc.
• Federation of data sources via Presto connectors
• Seamless integration with existing SQL systems by adhering to the ANSI SQL standard

Portions of this answer are excerpted from the O’Reilly book “Learning and Operating Presto,” get your free digital copy from Ahana: https://ahana.io/ebook/

Screen Shot 2021 04 22 at 3.10.39 PM

Presto FAQ 2
What is Presto used for?

Presto is used as the distributed query engine as part of an analytics platform. The presto engine is used for many different types of analytical workloads to address a variety of problems. Leveraging a separate storage and compute model, also known as “disaggregated,” Presto can query one or more data stores, metastores, and SQL tools.

Presto FAQ 3
What are common Presto use cases?

1. Ad hoc querying
Presto was originally designed for interactive analytics, or ad hoc querying. That means results are returned in a matter of a few seconds so an analyst can drill-down and ask questions of large amounts of data interactively. In today’s “Internet era” competitive world and with data systems able to collect granular amounts of data in near-real-time, engineers, analysts, data scientists, and product managers all want the ability to quickly analyze their data and become data-driven organizations that make superior decisions, innovating quickly and improving their businesses.

They either simply type in simple queries by hand or use a range of visualization, dashboarding, and BI tools. Depending on the tools chosen, they can run 10s of complex concurrent queries against a Presto cluster. With Presto connectors and their in-place execution, platform teams can quickly provide access to the data sets that users want. Not only do analysts get access, but they can run queries in seconds and minutes–instead of hours–with the power of Presto, and they can iterate quickly on innovative hypotheses with the interactive exploration of any data set, residing anywhere.

2. Reporting and dashboarding
Because of the design and architecture of Presto and its ability to query across multiple sources, Presto is a great backend for reporting and dashboarding. Unlike the first-generation static reporting and dashboarding, today’s interactive reporting and dashboards are very different. Analysts, data scientists, product managers, marketers and other users not only want to look at KPI’s, product statistics, telemetry data and other data, but they also want to drill down into specific areas of interest or areas where opportunity may lie. This requires the backend – the underlying system – to be able to process data fast wherever it may sit. To support this type of self-service analytics, platform teams are required to either consolidate data into one system via expensive pipelining approaches or test and support every reporting tool with every database, data lake and data system their end users want to access. Presto gives data scientists, analysts and other users the ability to query data across sources on their own so they’re not dependent on data platform engineers. It also greatly simplifies the task of the data platform engineers by absorbing the integration testing and allowing them to have a single abstraction and end point for a range of reporting and dashboarding tools.

3. ETL using SQL
Analysts can aggregate terabytes of data across multiple data sources and run efficient ETL queries against that data with Presto. Instead of legacy batch processing systems, Presto can be used to run resource-efficient and high throughput queries. ETL can process all the data in the warehouse; it generates tables that are used for interactive analysis or feeding various downstream products and systems.

Presto as an engine is not an end-to-end ETL system, nor is Hive or Spark. Some additional tools can be easily added to coordinate and manage numerous on-going time-based jobs, a.k.a. cron jobs, which take data from one system and move it into another store, usually with a columnar format. Users can use a workflow management system like open source Apache Airflow or Azkaban. These automate tasks that would normally have to be run manually by a data engineer. Airflow is an open source project that programmatically authors, schedules and monitors ETL work‐ flows, and was built by Airbnb employees who were former Facebook employees. Azkaban, another open source project, is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs.

The queries in batch ETL jobs are much more expensive in terms of data volume and CPU than interactive jobs. As such the clusters tend to be much bigger. So some companies will separate Presto clusters: one for ETL and another one for ad hoc queries. This is operationally advantageous since it is the same Presto technology and requires the same skills. For the former, it’s much more important that the throughput of the entire system is good versus latency for an individual query.

4. Data Lake Analytics
Data lakes have grown in popularity along with the rise of Amazon S3-compatible object storage which Amazon AWS has made popular. A data lake enables you to store all your structured and unstructured data as-is and run different types of analytics on it.

A data warehouse is optimized to analyze relational data, typically from transactional systems and business applications, where the data is cleaned, enriched, and transformed. A data lake is different from a data warehouse in that it can store all your data–the relational data as well as non-relational data from a variety of sources, such as from mobile apps, social media, time-series data–and you can derive more new insights from the analysis of that broader data set. Again you can do so without necessarily needing to process and copy that data beforehand.

Presto is used to query data directly on a data lake without the need for additional ingestion or copies. You can query any type of data in your data lake, including both structured and unstructured data. As companies become more data-driven and need to make faster, more informed decisions, the need for analytics on an increasingly larger amount of data has become a higher priority in order to do business.

5. Real-time analytics with real-time databases
Real-time analytics is becoming increasingly used in conjunction with consumer- facing websites and services. This usually involves combining data that is being captured in real time with historical or archived data. Imagine if an e-commerce site had a history of your activity archived in an object store like S3, but your current session activity is getting written to a real-time database like Apache Pinot. Your current session activity may not make it into S3 for hours until the next snapshot. By using Presto to unite data across both systems, that website could provide you with real- time incentives so you don’t abandon your cart, or it could determine if there’s possible fraud happening earlier and with greater accuracy.

Portions of this article are excerpted from the O’Reilly book “Learning and Operating Presto,” get your free digital copy from Ahana: https://ahana.io/ebook/. This includes a chapter of Presto with Pinot, a real-time database.

Presto FAQ 4
What is Apache Presto?

Apache Presto is a misnomer in that it is not a project hosted under the Apache Software Foundation (ASF). Incubator and top level ASF projects are subject to the naming conventions “Apache [Project Name]” One example of this is Apache Cassandra. Instead, the open source presto project is with another well-known project hosting body, The Linux Foundation. They make sub-foundations and Presto is specifically under the Presto Foundation. This is similar to how Kubernetes is the project in Cloud Native Computing Foundation, which is a sub-foundation of The Linux Foundation. The misnomer may have arisen from the fact that most open source projects use the Apache 2.0 license, which Presto is licensed with.

Presto is a distributed SQL query engine written in Java. It takes any query written in SQL, analyzes the query, creates and schedules a query plan on a cluster of worker machines which are connected to data sources, and then returns the query results. The query plan may have a number of execution stages depending on the query. For example, if your query is joining together many large tables, it may need multiple stages to execute, aggregating tables together. After each execution stage there may be intermediate data sets. You can think of those intermediate answers like your scratch‐ pad for a long calculus problem.

What are the AWS Glue partition limits and does it apply to AWS Athena?

Typically you’ll use AWS Glue to create the data sources and tables that Athena will query. You could also create your tables using Athena and then use the tables in AWS Glue. Today the limit for AWS Glue partitions is 10M while Athena’s partition limit is 20K partitions per table. So if you’re using AWS Glue you get more partitions if you’re using the AWS Glue catalog with your Athena deployment. 

When it comes to using Amazon Athena, there are a lot of other limits besides partitions including query, concurrent query, and database limits. Many times AWS Athena users find these limits to hinder SLAs and other business-critical applications. As a result, they’re looking for alternatives to AWS Athena and have found that Ahana Cloud’s Presto managed service can address those issues.

Ahana Cloud is a Presto managed service that gives users the power of Presto, similar to AWS Athena, but without the limitations around queries, performance, and control. With Ahana you get the full control of your Presto deployment on AWS, so you can scale as you need without having to worry about partition or query limits.

Learn more about how Ahana Cloud compares to AWS Athena.

Ready to get started? Sign up try out Ahana Cloud for free – if you’re an AWS Athena user, it’s really easy to move from Athena to Ahana Cloud and requires not change in your existing architecture.

You can also check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

What level of concurrency performance can I expect using Presto as part of the AWS Athena service?

I’m getting a lot of my workloads queued up when I use AWS Athena. What are my other options?

Background:

Concurrency is an important factor when it comes to submitting a lot of query workloads. Many data platform teams depend on being able to handle multiple requests from multiple users at the same time. In AWS Athena you can submit queries but they won’t always be running. The number of concurrently running queries provides you with your overall performance. As you may know, AWS Athena is serverless, so you don’t see much behind the curtains. As such depending on what amount the load is on the shared service (with other customers), you’ll see different operating characteristics, leading to non-deterministic performance. 

When your number of running queries hits a limit, any additional queries will be put into a queue to wait for queries to finish. We’ve heard that queuing can occur when there are as low as 3 or 5 concurrently running queries. This can make it difficult to provide consistent levels of performance to your users. 

Screen Shot 2021 03 09 at 6.35.45 PM

But I thought you could increase the service limits to address the queuing behavior?

Yes and no. AWS Athena has soft limits of the number of active queries. But it turns out that raising those limits don’t change the number of concurrently running queries, instead it seems to allow you to submit more requests without getting the “TooManyRequest” exception and there may be more queries that are in a RUNNING/0KB state, meaning they’re queued. 

Other alternatives

If you need to have consistent performance across a large number of queries that another approach would be to run your own Presto service. AWS Athena is a serverless version of Presto but as mentioned, has limits on performance. In addition, the pay per TB scanned model of Athena can also be cost prohibitive. There are 2 options for running your own query service: 

  1. Operating your own Presto clusters with AWS EMR or by yourself using AMIs
  2. Using a managed service like Ahana Cloud for Presto

On #2, Ahana Cloud for Presto provides you with a scalable amount of concurrent query performance but without the operations burden of managing your own clusters. Because the Presto is a distributed SQL query engine, by adding more Presto instances you can increase the amount of queries that are running at the same time:

Screen Shot 2021 03 09 at 6.33.20 PM

To compare the performance of a set of query workloads, it’s always best to look at the overall throughput, which means including the wait time associated with queued workloads in Athena. It is for this reason that running your own Presto service will always allow you to have higher performance with more flexibility than AWS Athena. 

This also has price-performance considerations and taking into account the wait time associated with queued workloads, you’ll see a much higher price-performance ratio when comparing AWS Athena to your own Presto service. 

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

Deterministic Performance

How do I get deterministic performance out of Amazon Athena?

What is Athena?

Amazon Athena is an interactive query service based on Presto that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage. This approach is advantageous for interactive querying on datasets already residing in S3 without the need to move the data into another analytics database or a cloud data warehouse. to learn how to get deterministic performance out of Athena continue reading.

What is Athena great for?

Users love Athena for the simplicity, ease of getting started, and the fact that no servers need to be managed. You only pay for the amount of data scanned. Athena is closely integrated with glue, so if you are already using the glue catalog for your ETL workloads, all your tables are already defined and accessible. You can then make these tables available for interactive analytics to your extended team.

deterministic performance

What are the trade-offs?

The simplicity of the deployment architecture of Amazon Athena, however, comes at a price of inconsistent performance with scale, as many users of Athena have already experienced.

There are two primary trade-offs with Athena. Firstly, transparency – “you get what you get” – you have no visibility to the underlying infrastructure serving your queries nor do you have the ability to control or tweak that infrastructure for your workloads. The approach works for cases where the performance/latency is not critical, but it can be a non-starter for users who need control. Secondly, despite all its advantages, shared services serverless models have disadvantages related to performance predictability.

AWS has documented best practices in their performance tuning guide to get the most out of Athena and to avoid typical errors that users encounter such as “Query exceeded local memory limit” and “Query exhausted resources at this scale factor” these include using optimized formats such as parquet and orc as well as avoid small files and partitioning the data appropriately.

How does the serverless architecture impact my query performance?

Athena at its core is a shared serverless service per region – to protect the usage spiraling out of control by a handful of customers Amazon has placed restrictions on the usage, size, concurrency, and scale of the service on a per-customer basis and within the region overall.  

These limits and restrictions are well documented in the service limits such as 20 active DDL and DML Queries each in most regions, 30-minute max query timeouts, API limits, and throttling among others (Some of these limits can be tweaked by working with Amazon support). These limits are guard rails around the service so that the usage of one customer doesn’t adversely affect the experience of another. These guardrails are however far from perfect since there are only a finite number of resources per region to be shared across customers in the region. Any excessive load due to seasonal or unexpected spikes at Amazon’s scale will easily consume the shared resources causing contention and queuing. 

In addition to the shared query infrastructure, Athena’s federated query connectors are based on Lambda, which is again serverless. Lambda scales out well and can be performant once warmed-up, however consistent performance comes only with consistent use. Depending upon the usage of a particular connector in the region and the available capacity of the backend infrastructure you could run into latencies caused by cold-starts, especially if you are using connectors that are not accessed frequently e.g. custom connectors.

If a large number of users end up using Athena at the same time, especially for large-scale queries, they often observe extended queuing of their queries. Though the eventual query execution might not take time once resources are available, the queueing significantly impacts the end-user experience for interactive query workloads. Users have also at times reported inconsistency of execution times of the same query from one run to another which ties back into the shortcomings of the shared resources model. 

So can I get deterministic performance out of Athena? 

Yes, you can get deterministic performance from Athena. If your queries are not complex, latency-sensitive and your infrastructure is in a less crowded region, you may not encounter performance predictability issues frequently. However, your mileage entirely depends upon several factors such as when you are running the query, which region are you running the query, the volume of the data you are accessing, your account service-limit configurations, to just name a few. 

What are my options to get deterministic performance?

To get deterministic performance from AWS Athena consider your usage. If your interactive query workload is latency-sensitive and you want to control the deterministic performance of your queries and the experience of your end-users, you need dedicated managed infrastructure. To obtain better deterministic performance, a managed Presto service gives you the best of both worlds – It abstracts the complexity of managing a distributed query service at the same time giving you the knobs to tweak the service to your workload needs.

Ahana provides a managed Presto service that can scale up and down depending on your performance needs. You can segregate workloads into different clusters or choose to share the cluster. You can also choose beefier infrastructure for more business and time-critical workloads and also set up separate clusters for less critical needs. You make that choice of price, deterministic performance, and flexibility depending upon business objectives.

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine for data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences in this article

Do I have to use AWS Lambda to connect to data sources with Athena?

The Athena Federated Query Journey

AWS announced the public preview of Athena federated query in November 2019 and moved the service to General Availability (GA) in November 2020. As of this writing, Athena supports two versions of Athena – engine 1 based on Presto 0.172, and engine 2 based on Presto 0.217. Among other features, the federated query functionality is only supported on engine 2 of Athena. 

What connectors are supported?

Athena currently supports a number of data sources connectors

  • Amazon Athena CloudWatch Connector
  • Amazon Athena CloudWatch Metrics Connector
  • Athena AWS CMDB Connector
  • Amazon Athena DocumentDB Connector
  • Amazon Athena DynamoDB Connector
  • Amazon Athena Elasticsearch Connector
  • Amazon Athena HBase Connector
  • Amazon Athena Connector for JDBC-Compliant Data Sources (PostgreSQL, MySQL, and Amazon Redshift)
  • Amazon Athena Neptune Connector
  • Amazon Athena Redis Connector
  • Amazon Athena Timestream Connector
  • Amazon Athena TPC Benchmark DS (TPC-DS) Connector

In line with the serverless model of AWS Athena, in order to maintain an impedance match between Athena and the Connectors, the connectors are also Serverless based on AWS Lambda and run within the region. These connectors are open-sourced under the Apache 2.0 license and are available on GitHub or can be accessed from the AWS Serverless Application Repository. Third-party connectors are also available in the serverless application repository.

Athena is based on Presto, Can I use the open-source Presto data source connectors with Athena?

Presto has an impressive set of connectors right out of the box, these connectors however cannot be used as-is with Athena. The Presto service provider interface (SPI) required by the Presto connectors is different from AWS Athena’s Lambda-based implementation which is based on the Athena Query Federation SDK.

You can use the Athena Query Federation SDK to write your own connector using Lamba or to customize one of the prebuilt connectors that Amazon Athena provides and maintains. The Athena connectors use Apache Arrow format for returning data requested in the query. An example Athena connector can be found here.

If you’re looking for a managed service approach for PrestoDB, Ahana Cloud offers that based on open source Presto. Ahana’s managed deployment is 100% compatible with the open source Presto connectors, and you can get started with a click of a button.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

How do I do geospatial queries and spatial joins in Presto?

A question that often comes up is “how do I do geospatial queries and spatial joins in Presto?”. Fortunately Presto supports a wealth of functions and geospatial-specific joins to get the job done.

Let’s get started with a step-by-step tutorial. First we’ll set-up some test data in two tables.  The first table is trips_table which will store 3 rows each representing simple taxi trips. We store a trip id, the origin and destination long/lat coordinates, and the journey start time:

create table trips_table (
trip_id int, 
orig_long double, 
orig_lat double, 
dest_long double, 
dest_lat double, 
trip_start timestamp);

insert into trips_table values (1, 51.50953, -0.13467, 51.503041, -0.117648, cast('2021-03-02 09:00:00 UTC' as timestamp));
insert into trips_table values (2, 34.039874, -118.258663, 34.044806, -118.235187, cast('2021-03-02 09:30:00 UTC' as timestamp));
insert into trips_table values (3, 48.858965, 2.293497,48.859952, 2.340328, cast('2021-03-02 09:45:00 UTC' as timestamp));
insert into trips_table values (4, 51.505120, -0.089522, 51.472602, -0.489912, cast('2021-03-02 10:45:00 UTC' as timestamp));

For information:

  • Trip 1 is a ride within central London, from Piccadilly Circus to the London Eye.
  • Trip 2 is a ride in downtown Los Angeles
  • Trip 3 is a ride from the Eiffel Tower to Musée du Louvre, Paris 
  • Trip 4 is a ride from Borough Market in central London to Heathrow Airport Terminal 5 (outside central London).

The second table is city_table with each row storing the shape representing an area e.g. central London, and a name for the shape. We represent the shape with a sequence of coordinates that enclose a specific area: 

create table city_table (
geo_shape varchar, 
name varchar);

insert into city_table values ('POLYGON((51.519287 -0.172316,51.519287 -0.084103,51.496393 -0.084103,51.496393 -0.172316,51.519287 -0.172316))', 'London, central');
insert into city_table values('POLYGON((33.9927 -118.3023,33.9902 -118.1794,34.0911 -118.2436,33.9927 -118.3023))', 'Los Angeles, downtown');

For information:

  • I used a simple triangle to represent downtown LA (see illustration below)
  • I used a rectangle representing central London. 
  • In each case the first pair coordinates for the shape are the same as the last pair – so it’s an enclosed bounding box or polygon we’re describing.
  • We’re storing our shapes as text in a varchar column for simplicity.
  • We describe each polygon as comma-separated pairs of long/lat coords using the POLYGON(()) function.  The double brackets are required. 
Screen Shot 2021 03 04 at 12.26.17 PM
Simple polygon to represent downtown LA 
(credit: “Geospatial Data Analysis using SQL”)

Now let’s run a query to count how many trips occurred in each city. We join our two tables, and we use each journey’s originating long/lat coordinates to determine – using ST_CONTAINS() – if that point exists in any of our shapes.  This function requires the polygon to be expressed as a special type – Geometry – so we convert our shape from text using ST_GeometryFromText() function:

SELECT c.name as City, count(*) as Trips 
FROM trips_table as t 
JOIN city_table as c 
ON ST_Contains(ST_GeometryFromText(c.geo_shape), st_point(t.orig_long, t.orig_lat)) 
GROUP BY 1;

         City          | Trips 
-----------------------+-------
 Los Angeles, downtown |     1 
 London, central       |     2 
(2 rows)

We see both London trips made it into the result set, despite one of the trips ending at the airport which is a way outside the shape we defined for central London – this is because the query uses the originating coordinates for each trip, not the destination coordinates. 

Also notice the Paris trip didn’t make it into the result – this is because we did not define a shape for Paris.

In this example you’ve seen some of the benefits of using Ahana Cloud for Presto

Presto’s Geospatial functions are listed in the Presto documentation.  

We have a ton of resources to help you get started with Presto, check them out here.

How do I query JSON documents with Presto?

JSON documents are a common data type. A lot of people collect logs and load them into S3. Querying JSON with Presto can be challenging because you may not always have a consistent schema. However, the great part about Presto is that it has functionality that enables you to get insights on your JSON data.

Presto has a JSON data type. The JSON data type is similar to a string. One way to think about this is that it is just a string containing JSON, with the exception that the data inside the string has already been parsed, it’s already well-formed JSON.

But it’s also slightly semantically different in that JSON has maps. And when you have a string with a map, the keys of the map can be ordered differently. But in the JSON data type, the keys in the map are always going to be in a consistent order. And it’s worth noting that two maps of the data with different JSON data type values will compare the same, even if the keys are in different order. This can be important when doing an aggregation regrouping or a comparison. Logically two maps will be treated as the same value, no matter what the order is.  

How the JSON parse function works

The JSON parse function takes a JSON string and returns a JSON data type. Here is an example of the JSON data type:

SELECT json_parse('null'); -- JSON 'null'
SELECT json_parse('true'); -- JSON 'true'
SELECT json_parse('42'); -- JSON '42'
SELECT json_parse('"abc"'); -- JSON '"abc"'
SELECT json_parse('[1, 2, 3]'); -- JSON '[1,2,3]'
SELECT json_parse('["a": 1, "b": 2]'); -- JSON '["a": 1, "b": 2]'
 
SELECT json_parse('[”hey”, 42, {“xyz”: 100, “abc” : false}]’); 

After its parts with json_parse, the result is that the keys in the map are now ordered. This is how you result with the same JSON data type and how they compare the same. When two different maps with differently ordered keys are parsed, they’ll have the same contents.

In Presto, all the JSON functions take the JSON data type. And if you pass a string into the JSON function, it will be implicitly parsed into JSON.

Another question that comes up is what is the difference of NULL between JSON and SQL? J_null and a SQL null?

This can be confusing because JSON has a NULL and so does SQL. These are two different things. A SQL NULL is considered a special value and work very differently like they do in other programming languages. Operations on a NULL gets turned into a NULL. Whereas a JSON NULL is just another value, like a number or a string or an array. So they’re totally different. This has implications when doing Boolean expressions.

We have a lot more resources on Presto if you’d like to learn more, check out our Getting Started with Presto page.

Is there latency overhead for Presto queries if everything fits into memory and doesn’t need to be distributed?

Presto is both in-memory and distributed, so each work has memory and uses it. However, Presto is not an in-memory database. Presto is the query engine and reads from storage underneath. This is where Presto’s caching capabilities become important.

If all the data required to satisfy a given query arrives from the connected data sources and fits within the memory of the Presto cluster, everything should work accordingly. However, Presto’s MPP, in-memory pipelining architecture will accelerate workloads further in two ways:

1) Presto will ensure data is divided across all the workers to bring maximum processing power and memory capacity to bear.

2) Presto’s generated query plan will ensure execution steps are distributed amongst all the workers such that processing takes place in parallel on the data across all workers simultaneously, as efficiently as possible. Furthermore, with its caching capabilities, Presto can accelerate query execution even further for specific workload patterns, further reducing latency.

If you want to get started with Presto, check out our docs to learn more about the Ahana Cloud managed service for Presto.

Is the Hive metastore a hard dependency of Presto, or could Presto be configured to use something else like Postgres?

With Presto, there’s no hard dependency of having to use the Hive metastore – it’s catalog-agnostic. However, there are significant advantages like better performance when you use the Hive metastore.

A managed service like Ahana Cloud for Presto provides a managed version of a Hive metastore catalog, so you don’t have to deal with the complexity of managing their own metastore. Ahana also supports AWS Glue as a data catalog service.

You don’t necessarily have to use the metastore if you are connecting via the provided connectors such as MySQL, PostGres, Redshift, Elastic, etc. But if you intend to query data on S3, for instance, you will need a metastore (i.e. Hive or Glue) to define external tables and their formats.

Presto has many connectors including Postgres, which is a supported data source.

The Differences Between Apache Drill vs Presto

Drill is an open source SQL query engine which began life as a paper “Dremel: Interactive Analysis of Web-Scale Datasets” from Google in 2010. Development of Apache Drill began in 2012.

Performance & Concurrency
Drill has never enjoyed wide adoption partially because it was tied to one Hadoop distribution (MapR) but mainly because of inherent performance and concurrency limitations. There are companies that built products based on Drill who report these performance and concurrency issues, and many have migrated away from Drill as a result. Presto’s popularity and adoption, on the other hand, has exploded as numerous companies from SME to web-scale have deployed Presto and contribute to the project. These include Facebook, Uber, Alibaba, Twitter, Netflix, AirBnB and LinkedIn. Users value Presto’s vendor-neutrality and rate of innovation.

Metastores
Presto connects to external metastores (AWS Glue, Hive Metastore Catalog); many users deploy Presto + AWS Glue/Hive for their data lake analytics. In addition, schema for relational data sources can be obtained directly from each connector. On the other hand, Drill performs its own schema discovery and does not need a metastore such as Hive (but can use Hive if needed), which is a benefit.

Community
Compared to Presto, the Apache Drill community has dwindled somewhat especially when it comes to adoption and contributions. We talk to many users who are looking to move from Apache Drill to Presto. If we look at DB engines – https://db-engines.com/en/ranking/relational+dbms – we see that Presto has continued on a positive upward trend and ranks #27, as opposed to Drill which is at #49 in a downward trend.

Overall, Drill is in decline. This was accelerated when HPE (Hewlett Packard Enterprise, who acquired MapR) announced they will no longer support or contribute to Drill (a drawback of having a query engine tied to an ecosystem such as Hadoop). Presto’s popularity continues to increase, as illustrated by 1) the number of commits to the project (In the past month 46 authors have pushed 79 commits to master and 88 commits to all branches, and on master 568 files have changed and there have been 10,930 additions – as of Feb 26 2021), and 2) Presto’s continued rise on DB-engines.

Drill is a top-level Apache Foundation project but does not have a strong and active community behind it. Presto is backed by a strong community and is overseen by the Presto Foundation which is part of the Linux Foundation with 8 premier organizations driving it including Facebook, Uber, Twitter, Intel and Ahana.

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

Why am I getting zero records when I use AWS Athena to query a CSV file?

There’s a common error many AWS Athena users see when they query CSV files – they will get zero records back. This tends to happen when they run the AWS Glue crawler and create a new table on the CSV file. There are several reasons why the query might return no records.

  1. The crawler is pointing to a file instead of an AWS S3 bucket
  2. The LOCATION path is wrong or there’s a double slash in it
  3. The partitions haven’t loaded into the Glue Data Catalog or the internal Athena data catalog
  4. In general, CSV crawlers can be sensitive to different issues (i.e. embedded newlines, partially quoted files, blanks in integer fields), so make sure everything is accurate

On the first point, if you have selected a file instead of your S3 bucket, the crawler will succeed but you won’t be able to query the contents which is why you’ll see the ‘Zero Records Returned’ error message. If there are other files that you don’t want crawled, you’ll need to create a new folder and move your CSV to that new folder, and then update the include path accordingly (and you’ll need to re-crawl it).

In order to get your S3 data lake to work, you’ll need to make sure each batch of same-schema files has its own top level directory.

If you’re running into issues like this with AWS Athena and want to offload the management while still getting the power of Presto, check out Ahana Cloud. It’s a fully managed service for AWS that removes these types of complexities and makes it really easy to run Presto in the cloud.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

Does Presto work natively with GraphQL?

Some users may have a primary data store that is GraphQL-based (AWS AppSync) and want to leverage Presto. For context, GraphQL falls in the application as part of the service mesh infrastructure. It is an abstraction API for mostly operational sources.

Typically for analytics, users move their data into an analytical stack, because the properties and queries of an operational stack and analytical stack are quite different.

Users with your environment move the data to S3 to run data lake analytics. Presto is the de facto engine for data lake analytics, but in addition to that is also allows for running queries across S3 and operational systems like RDS.

Running analytics on GraphQL / AppSync is not recommended because:

  1. GraphQL isn’t really a query engine with an optimizer, etc.
  2. Analytical queries are fairly intensive from a compute perspective and needs the right analytical engine to process and execute the queries. GraphQL is mostly a pass through and this will affect your operational systems.

Want to learn more about Presto? Download our free whitepaper: What is Presto?

Why does a single AWS Athena query get stuck in QUEUED state before being executed?

One of the drawbacks of AWS Athena is the fact that as a user, you don’t have control over query performance and predictability. One specific issue that comes up quite a bit for Athena users is single queries getting stuck in a QUEUED state before being executed. This happens because of Athena’s serverless nature in combination with its multi-tenant service – you are essentially competing for resources when you run a query.

Many users will get the message “Query waiting in queue” somewhat randomly. For this reason, it’s recommended to not use Athena for user-facing requests because of its unpredictability, and generally many users have decided they can’t rely on Athena at all based on this issue. If you can’t predict demand and manage multi-tenancy, it’s not ideal for ad-hoc analytics (which is a core use case for PrestoDB).

For users who still want to leverage PrestoDB for ad-hoc analytics and data lake analytics, Ahana Cloud is a solution that allows you to run Presto transparently as a managed service. You won’t run into the issues around unpredictability and queries getting stuck. There’s a 14 day free trial if you’re interested in checking it out.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

How Presto Joins Data

Because Presto is a distributed system composed of a coordinator and workers, each worker can connect to one or more data sources through corresponding connectors.

The coordinator receives the query from the client and optimises and plans the query execution, breaking it down into constituent parts, to produce the most efficient execution steps. The execution steps are sent to the workers which then use the connectors to submit tasks to the data sources. The tasks could be file reads, or SQL statements, and are optimised for the data source and the way in which the source organises its data, taking into account partitioning and indexing for example.

The data sources supported by Presto are numerous and can be an RDBMS, a noSQL DB, or Parquet/ORC files in an object store like S3 for example. The data sources execute the low level queries by scanning, performing filtering, partition pruning etc. and return the results back to the Presto workers. The Presto join operation (and other processing) is performed by the workers on the received data, consolidated, and the joined result set is returned back to the coordinator.

You will notice Presto uses a “push model” which is different, for example, to Hive’s “pull model”. Presto pushes execution steps to the data sources, so some processing happens at the source, and some happens in Presto’s workers. The workers also communicate between each other, and the processing takes place in memory which makes it very efficient, suitable for interactive queries.  Hive on the other hand will read/pull a block of a data file, execute tasks, then wait for the next block, using the map reduce framework. Hive’s approach is not suitable for interactive queries since it is reading raw data from disk and storing intermediate data to disk, all using the framework MapReduce, which is better suited to long-running batch processing. This diagram compares Hive and Presto’s execution approaches:

image

How Presto Joins Relational and Non-Relational Sources

The next diagram shows some of Presto’s core Coordinator components, and the kinds of tasks   Presto’s workers handle. In this simplistic example there are two data sources being accessed; one Worker is scanning a Hive data source, the other worker is scanning a mongoDB data source. Remember Presto does not use Hive’s mapreduce query engine or HQL – the diagram’s “hive” worker means it is using the “hive connector” and the file system is the metastore information, and the raw source data is external to Presto, maybe in HDFS in Parquet or Orc format, for example.  The Worker dealing with mongo data is described as being on “the probe side” (in this example) whereby the mongo data is read, processed, normalized into columnar stores in Presto and then shuffled (or exchanged) across the cluster to the “builder” worker (the worker dealing with the hive data here) for the actual Presto joins to take place.

image 1

This is a simplistic example since in reality Presto is more sophisticated – the join operation could be running in parallel across multiple workers, with a final stage running on one node (since it cannot be parallelized). This final stage is represented by the third worker at the top of the diagram labeled  “Output”. This outputNode’s task is to stream out the result set back to the coordinator, and then back to the client.

To see an example of a complex join in Presto, check out our previous article on CROSS JOIN UNNEST.

Joins – Summary

It’s easy to see how Presto is the “Polyglot Data Access Layer” since it doesn’t matter where your data lives, any query can access any data, in-place, without ETL or data shipping or duplication. It is true federation. Even when blending very different sources of data, like JSON data in elasticsearch or mongodb with tables in a MySQL RDBMS, Presto takes care of the flattening and processing to provide a complete, unified view of your data corpus.

If you want to try out Presto, take a look at Ahana Cloud. It provides a managed service for Presto in AWS.

Using Spark’s Execution Engine With Presto

Learn when to use Spark as an additional engine alongside open-source Presto, and how you can configure and invoke a Spark job from Presto.

Executing Presto Spark queries is possible, but why leverage Spark as an execution framework for Presto’s queries when Presto is itself an efficient execution engine?  The fact that both are in-memory execution engines tends to add further confusion.

Presto and Spark are different kinds of engines – Presto specializes in analytical query execution; Spark’s emphasis is on calculation. The following should help clarify why Presto users would or would not want to use Spark as an additional engine. 

A Quick Presto/Spark Comparison 

Executing Presto queries on Spark can be useful for some very specific workloads,  such as queries that we want to run on thousands of nodes, or for queries requiring 10s or 100s of TBs of memory, queries that may consume many CPU years, or if we need extra execution resilience.  So if you have a particularly ugly, complex and very long-running query, this may be an option. 

Spark provides several features that can be desirable for certain workloads, like:

  • Resource isolation
  • Fine-grained resource management
  • Spark’s scalable materialized exchange mechanism 
  • In terms of memory usage, both are in-memory execution engines but with one important difference: Spark will write/spill data to disk when memory is exhausted.
  • Spark has data processing fault tolerance, so if a worker node fails and its data partition goes away,that partition will be re-computed. 

There are also some downsides, or at least considerations, when using Spark with Presto: 

  • Spark commits tasks and applies for resources dynamically as it needs them. So at each stage tasks will grab their required resources, a strategy that can slow down processing speed. Presto on the other hand does this “upfront” on the most part..
  • The way in which data is processed by the two engines differs: Spark needs data to be fully processed before progressing to the next processing stage. Presto uses a pipeline processing approach where data is sent to the next task as soon as possible which can reduce total execution time.
  • Spark SQL supports a subset of the ANSI SQL standard. Many features are missing, and for this reason many developers avoid Spark SQL. 

When To Use Spark’s Execution Engine With Presto

It is recommended that you:

  • Do not use Spark for interactive queries. This is not what Spark was designed for, so performance will be poor.
  • Do use Spark for long-running, SQL-based batch ETL queries, provided Spark SQL supports the features and functions you need. You can run ETL/batch type queries in Presto, but you may choose to use Spark as the execution engine with Presto if a user / developer wants to test their query in adhoc mode using Presto first and later convert it to Spark when they want to run a batch pipeline in production.

Another scenario where Presto and Spark can be used together is if there is no connector available in Spark for the data source you want to access then use Presto and its connectors to access the data then access Presto via JDBC from Spark.

Setup  

Setting up Presto and Spark and submitting jobs is well documented in PrestoDB’s docs at https://prestodb.io/docs/current/installation/spark.html and it is straightforward to set-up.  You will need Java and Scala installed. You need a working Presto cluster, and a Spark cluster of course. Then there are two packages to download, and a file to configure. You then execute your queries from the Spark cluster, by running spark-submit.

Tip: If you are using a Mac, standalone Spark can be installed very simply with brew install apache-spark. Similarly Presto can be installed with  brew install prestodb. 

Verify your Spark cluster by running spark-shell on the command line.

Configuration

Presto’s config.properties file needs to have its task.concurrency, task.max-worker-threads and task.writer-count parameters set to values appropriate for your hardware. Ensure these values also match the parameters used with the spark-submit command.

Invocation 

Use the spark-submit command to invoke Spark, passing the sql query file as an argument. See https://prestodb.io/docs/current/installation/spark.html for an example spark-submit command. 

The above article should help you understand the how, why and when of executing Presto queries on Spark.

When I run a query with AWS Athena, I get the error message ‘query exhausted resources on this scale factor’. Why?

AWS Athena is well documented in having performance issues, both in terms of unpredictability and speed. Many users have pointed out that even relatively lightweight queries on Athena will fail. One part of the issue may be due to how many columns the user has in the Group By clause – even a small amount of columns (like less than 5 columns) will run into this issue of not having enough resources to complete. Other times it may be due to how much data is being parsed, and again even small amounts of data (like less than 200MB) will run into this issue of not having enough resources to complete.

Presto stores Group By columns in memory while it works to match rows with the same group by key. The more columns that are in the Group By clause, the fewer number of rows that will get collapsed with the aggregation. To address this problem, users will have to reduce the number of columns in the Group By clause and retry the query.

And still at other times, the issue may not be how long the query takes but if the query runs at all. Users that experience “internal errors” on queries one hour will re-run the same queries that triggered those errors and they will succeed.

Ultimately, AWS Athena is not predictable when it comes to query performance. That’s where Ahana Cloud, a managed service for Presto, can help. Ahana’s managed service for PrestoDB can help with some of the trade offs associated with a serverless service.

Some of the reasons you might want to try a managed service if you’re running into performance issues with AWS Athena:

  • You get full control of your deployment, including the number PrestoDB nodes in your deployment and the node instance-types for optimum price/performance. 
  • Consistent performance because you have full control of the deployment
  • Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, increased resilience, speed, and ease of use. 
  • No limits on queries

Plus you can use your existing metastore, so you don’t need to modify your existing architecture.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

Creating tables in a S3 Bucket gives “Query failed External location must be a directory”

So here’s why you are here. You are using Presto’s Hive connector and you want to create a new table using an S3 folder that already exists. You have full read/write access to the folder, you can see the folder, but Presto tells you “External location must be a directory”.

Whiskey Tango Foxtrot – The directory isn’t a directory? 

In this example, you have already created your target directory myDir using the AWS S3 console, in a bucket called myBucket,  but attempts to create the customer table in that directory fail:

presto> CREATE TABLE hive.customer ( customer_id bigint, email varchar(64) ) 
WITH ( external_location = 's3a://myBucket/myDir/' , format = 'PARQUET' );

Query 20210104_110546_00015_mhsym failed: External location must be a directory

The issue here is not immediately obvious, read on further to learn more.

Why is the issue occurring?

S3 is an immutable object store and not a traditional file system. Objects are stored as key-value pairs internally, the key is the “Path” and the value is the actual object. There is no such thing as a folder in the traditional filesystem sense of the world. All there is are objects and associated metadata. If the path doesn’t have the appropriate metadata associated with it (in this case content-type: ‘application/x-directory’) Presto will complain that the path is not a directory

You have 2 primary options of addressing the issue:

  1. Manually associate the metadata via the AWS Console UI
  2. Copy the object over again using S3 command line or python script

Note: Since S3 is an immutable store, so there is no way to simply “update the metadata”. Under the covers, if you use the AWS Console, the UI is performing a clone of the object and updating the metadata. If you need to do it via API you have to perform a copyObject and update the metadata. 

See snippet from the AWS docs:  https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html

“You can set object metadata in Amazon S3 at the time you upload the object. Object metadata is a set of name-value pairs. After you upload the object, you cannot modify object metadata. The only way to modify object metadata is to make a copy of the object and set the metadata.

So how do I fix the issue?

Option 1: Change metadata using the AWS Console UI

Choose “Edit metadata” for the path in question

EditMetadata
Choose Edit metadata for the Path

Add ‘Content-Type’ as ‘application/x-directory’ as a system defined metadata

Add Content Type
Set Content-Type to ‘application/x-directory’

Option 2: Change metadata using tools (copy objects over again)

AWS Cli

aws s3 cp --content-type 'application/x-directory' s3://<bucket>/<object> s3://<bucket>/<object> --metadata-directive REPLACE

S3 API:

#Put with metadata
╰$ aws s3api put-object --bucket mybucket --key folder_name_xxx/ --content-type application/x-directory
{
    "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\""
}
 
#Verify the metadata
 
╰$ aws s3api head-object --bucket mybucket --key folder_name_xxx/
{
    "AcceptRanges": "bytes",
    "LastModified": "Fri, 23 Apr 2021 16:03:21 GMT",
    "ContentLength": 0,
    "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
    "ContentType": "application/x-directory",
    "Metadata": {}
}

Python Script

import boto3
s3 = boto3.resource('s3')
s3_object = s3.Object('bucket-name', 'key')
s3_object.metadata.update({'Content-Type':'application/x-directory'})
s3_object.copy_from(CopySource={'Bucket':'bucket-name', 'Key':'key'}, Metadata=s3_object.metadata, MetadataDirective='REPLACE')
 
Alternatively:
s3 = boto3.client('s3')
s3.put_object(Bucket=BUCKET, Key=KEY, ContentType='application/x-directory')

The above-mentioned steps should help you resolve the issue and your create table should work fine without complaining about the ‘External location must be a directory’ error.

--Now CREATE TABLE works fine
presto> CREATE TABLE hive.customer ( customer_id bigint, email varchar(64) )
 WITH ( external_location = 's3a://myBucket/myDir/' , format = 'PARQUET' );
CREATE TABLE

Big Data Query

When it comes to querying big data using standard SQL you have come to the right place as this is what Presto was designed to do. Presto, the leading open source distributed query engine,  gives users the ability to interactively query any data and any data source at any scale.  And it is proven – Presto has been in large-scale production deployment at internet-scale companies like Facebook, Uber, and Twitter for several years and the project is enjoying continued innovation as a result. 

Ahana’s mission is to simplify ad hoc analytics for organizations of all shapes and sizes by making open source Presto a fully integrated, cloud-native, managed service for AWS – the easiest Presto experience there is. 

Utilizing Big Data

The role of a data analyst, and those of other members within a data team, is to discover patterns, trends and correlations in large datasets. Large data analytics is the umbrella term for such activities as generating reports, creating dashboards, to running complex queries.  A big data query is a request for data from a database table or from a combination of many different tables or files, which may be in different file formats, in relational databases and in object storage like S3.

What is Presto

Subtitle for This Block

Title for This Block

Text for This Block

In the past, data analysts and data teams (which usually is comprised of a variation of members including: data engineers, data architects, and data scientists) typically queried data stored in traditional relational databases and data warehouses – like a data warehouse. These days, the amount of data being generated is too big to be stored solely in on-prem relational databases; as a result, big data is now increasingly stored in the cloud and/or in distributed clusters. To query these huge and disperse datasets, you need a distributed query engine like Presto.

The good news is, if you work with terabytes or even petabytes of data, Presto is the ideal high-performance, distributed SQL query engine for actively querying expansive and diverse datasets. It is an open source, distributed SQL query engine that is specifically designed to run big data queries against data of any size, on a wide variety of sources. It makes use of multiple connectors which allow you to access these data sources and query them in place; there is no need to move the data.

Presto for Big Data

One of the main benefits of Presto is that it separates data storage from processing, so analysts can query big data where it is stored without having to move all of the data into a separate analytics system. With just a single Presto query, you can query big data from multiple sources with fast response times ranging from sub-seconds to minutes. Another benefit of Presto is that big data analysts can use standard query language (SQL) to query the data, which means they don’t have to learn any new complex languages.

This is why Presto is becoming the de facto standard for Big Data Querying.

Related Articles

A Comprehensive Guide to Data Warehouse Types

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Learn more about what these data warehouse types are, what they are not, and the benefits they provide to data analytics team members within organizations.

Presto vs Snowflake: Data Warehousing Comparisons

Presto is an open-source SQL query engine, developed by Facebook, for large-scale data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and an analytics service. Learn more about the differences between Presto and Snowflake in this article.

I use RDS Postgres databases and need some complex queries done which tend to slow down my databases for everyone else on the system. What do I need to consider as I add a data lake for the analytics?

Background

Many medium-sized companies start out using one of the six flavors of Amazon RDS: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. As they grow they can have numerous instances of RDS with numerous databases for each customer, whether internal or external. When one customer tries a large analytic query, that workload can cause problems with the RDS cluster, perhaps making it to drop other workloads, fail, or slow down for others. As the needs for processing huge amounts of data increases, so does the need to take your analytics to the next level.

In addition to your operational databases, the idea is to have a much more open analytics stack where you have the ability to run different kinds of processing on the same data. A modern analytics stack lets your organization have much more insights without impacting your operational side. And doing that with open data formats is another key consideration.

Considerations

There’s a couple options for evolving your analytics stack. One would be to use a cloud data warehouse like Amazon Redshift or Snowflake. Another would be to use open formats in a data lake with a modern SQL query engine. In the first case, there are some advantages of having the highest performance possible on your data but it comes at a certain cost as well as an amount of lock-in, as you cannot easily get at the data in proprietary formats. Considering the data lake with query engine option, we believe that Presto is one of the best choices because of its performance, scalability, and flexibility to connect to S3-based data lakes and federate other data sources as well.

So our recommendation would be to run Presto on top of data stored in an open Parquet or ORC format in S3. Doing it this way, you can put other engines on top of the data as needed, do you’re not going to face a lot of rework in the future should you decide to change something.

From OLTP to Data Lake Analytics

The high level concept is to have an initial one-time bulk migration of your data in OLTP databases to get a copy moved in to S3. After that, as your operational databases will continue to generate or change data, you’ll need establish a pipeline, a stream, or a Change Data Capture (CDC) process in place to get those into S3. BTW, not often will you want data going back from S3 into your relational databases.

While there are different ways to pipe data into S3, one AWS recommended approach is to use the AWS Database Migration Service a.k.a. DMS(much in the same way you may have used it when you migrated off-prem). With AWS Database Migration Service, you can get the first one-time bulk load and then continuously replicate your data with high availability and stream data to Amazon S3. AWS DMS would run in the background and handle all of the data changes for you. You can pick the instance and what period of time you’d like it to run, for example, you may want it to run hourly partitions or daily partitions. That’ll depend on how fast your data is changing and what your requirements are for analytics as well.

Next you’ll want to install Presto on top, and for that you can build a Presto cluster yourself, or simply use Ahana Cloud for Presto to create pre-configured clusters in about 30 minutes.

It’s also worth noting that after you’ve ingested the data into S3 based on what you think the most optimized format or folder structure would be, you may find out that you you need it different. In that case, not to worry, you can use Presto itself to do data lake transformations as well, using SQL you can do a CTAS, Create Table As Select:

The CREATE TABLE AS SELECT (CTAS) statement is one of the most important SQL features available. CTAS is a parallel operation that creates a new table based on the output of a SELECT statement. CTAS is the simplest and fastest way to create and insert data into a table with a single command.

Summary

By adding a modern analytics stack to your operational databases, you evolve your analytics capabilities and deliver more insights for better decisions. We suggest moving to an open data lake analytics reference architecture with Presto. This allow a meta analysis of all your data, giving a look at broader trends across databases and other data sources.

Complex SQL Queries

Complex SQL queries benefit from Presto’s distributed, parallel, in-memory processing architecture and cost-based optimizer. And with Presto’s federation capabilities even more complex queries can be unleashed on multiple data sources in a single query. 

To ensure optimum performance with complex queries Presto has features like dynamic filtering which can significantly improve the performance of queries with selective joins by avoiding reading data that would be filtered by join condition.  And the collection of table statistics using ANALYZE tablename is highly recommended. 

Complex SQL features and functions are used when advanced calculations are needed, or when many tables (perhaps from multiple sources) are to be joined, when dealing with nested or repeated data, dealing with time-series data or complex data types like maps, arrays, structs and JSON, or perhaps a combination of all these things.  

Presto’s ANSI SQL engine supports numerous advanced functions which can be split into the following categories – links to the PrestoDB documentation are provided for convenience:

Advanced SQL Queries with Presto

Advanced SQL features and functions are used by analysts when, for example, complex calculations are needed, or when many tables (perhaps from multiple sources) need to be joined, when dealing with nested or repeated data, dealing with time-series data or complex data types like maps, arrays, structs and JSON, or perhaps a combination of all these things.  

Presto’s ANSI SQL engine supports numerous advanced functions which can be split into the following categories – links to the PrestoDB documentation are provided for convenience:

Running advanced SQL queries can benefit greatly from Presto’s distributed, in-memory processing architecture and cost-based optimizer. 

How To Stop Presto

If you are using the presto-admin tool this is how to stop Presto safely:

$ presto-admin server stop

In addition these commands are also useful:

presto-admin server start
presto-admin server restart
presto-admin server status

If you are using Presto on EMR, you can EMR restart Presto with:

sudo stop presto
sudo start presto

(If you have changed any configuration params you should do this on every node where you made a change).

The above “How To Stop Presto” information is correct for PrestoDB. But other “hard forks” of Presto may use different methods to stop and start. 

Presto New Releases

Where is the latest release of PrestoDB? And where can I find the release notes? Where is the documentation? These are common questions with easy answers. Presto’s web site https://prestodb.io/ and GitHub have all the information you need on Presto new releases:

Releases: https://github.com/prestodb for the Presto and Presto-admin repositories.

Release Notes: https://prestodb.io/docs/current/release.html 

Documentation: https://prestodb.io/docs/current/ 

The above list has all the main resources you need for working with Presto’s New Releases. If you’re looking to get up and running quickly with Presto, Ahana Cloud is a SaaS for Presto platform. Presto can be complicated with over 200 parameters to configure and tune – Ahana Cloud takes care of all those; all you do is point and click to get your Presto cluster up and querying your data sources. You can sign up for a free trial and check it out at https://ahana.io/sign-up

How Much Memory To Give A Presto Worker Node

Presto is an in-memory query engine and so naturally memory configuration and management is important. A common question that comes up is how much memory should I give a worker node?

JVM Memory

Presto’s JVM memory config nearly always needs to be configured – you shouldn’t be running Presto with its default setting of 16GB of memory per worker/coordinator. That’s the Presto max.

The Xmx flag specifies the maximum memory allocation pool for a Java virtual machine. Change the  -Xmx16G in the jvm.config file to a number based on your cluster’s capacity, and number of nodes.  See https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html on how to do this.  

Rule of Thumb

It is recommended you set aside 15-20% of total physical memory for the OS.  So for example if you are using EC2 “r5.xlarge” instances which have 32GB of memory,  32GB-20% = 25.6GB so you would use  -Xmx25G in the jvm.config file for coordinator and worker (or  -Xmx27G if you want to go with 15% for the OS).

This is assuming there are no other services running on the server/instance, so maximum memory can be given to Presto. 

Presto Memory 

Like with JVM above, there are two memory related settings that you should check before starting Presto. For most workloads Presto’s other memory settings will work perfectly well when left at their defaults. There are configurable parameters that control memory allocation that could be useful for specific workloads however.  The practical guidelines below will 1) help you decide if you need to change your Presto memory configuration, and 2) which parameters to change.

Workload Considerations

You may want to change Presto’s memory configuration to optimise ETL workloads versus analytical workloads, or for high query concurrency versus single-query scenarios.  There’s a great in-depth blog on Presto’s memory management written by one of Presto’s contributors at  https://prestodb.io/blog/2019/08/19/memory-tracking which will guide you in making more detailed tweaks. 

Configuration Files & Parameters 

When first deploying Presto there are two memory settings that need checking. Locate the config.properties files for both the coordinator and worker. There are two important parameters here: query.max-memory-per-node and query.max-memory.  Again see https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html for rules-of-thumb and how to configure these parameters based on the available memory. 

The above guidelines and links should help you when considering how much memory should you give a worker node. You can see there’s a lot of tuning and config’s to manage – over 200. To avoid all memory configuration work you can use Ahana Cloud for Presto – a fully managed service for Presto, that needs zero configuration. Sign up for a free trial at https://ahana.io/sign-up.

How to Show Tables From All Schemas with Presto

In Presto it is straightforward to show all tables in a schema e.g. If we have a MySQL data source/catalog that has a “demo” schema we use show tables in mysql.demo; but this only reveals the tables managed by that data source.

There is no equivalent way to show all tables in all schemas for a data source. However there’s the metastore to fall back on which we can query: In MySQL, Glue, Hive and others there is a schema called “information_schema” which contains a table called “tables”.  This maintains a list of  schemas and tables relating to that data source.

For a MySQL data source, here’s what Presto shows in a Presto table:

presto> select table_schema, table_name from mysql.information_schema.tables order by 1,2;
    table_schema    |                  table_name                   
--------------------+-----------------------------------------------
 demo               | geography                                     
 demo               | state                                         
 demo               | test                                          
 information_schema | applicable_roles                              
 information_schema | columns                                       
 information_schema | enabled_roles                                 
 information_schema | roles                                         
 information_schema | schemata                                      
 information_schema | table_privileges                              
 information_schema | tables                                        
 information_schema | views                                         
 sys                | host_summary                                  
 sys                | host_summary_by_file_io                       
 sys                | host_summary_by_file_io_type                  
...

For Ahana’s integrated Hive metastore:

presto:demo> select table_schema, table_name from ahana_hive.information_schema.tables order by 1,2;
    table_schema    |        table_name         
--------------------+---------------------------
 csv_test           | yellow_taxi_trips         
 csv_test           | yellow_taxi_trips_orc     
 csv_test           | yellow_taxi_trips_staging 
 information_schema | applicable_roles          
 information_schema | columns
...    

This should help you show tables from all schemas.

How To Convert Bigint to Timestamp with Presto

UNIX timestamps are normally stored as doubles. If you have UNIX timestamps stored as big integers then you may encounter errors when trying to cast them as timestamps:

presto> select col1 from table_a;
     col1      
------------
 1606485526 
 1606485575 
 
presto> select cast(col1 as timestamp) from table_a;
Query 20201127_150824_00052_xnxra failed: line 1:8: Cannot cast bigint to timestamp

There is a solution!  Presto’s from_unixtime() function takes a UNIX timestamp and returns a timestamp:

presto> select col1,from_unixtime(col1) as ts from table_a;
    col1    |          ts          
------------+-------------------------
 1606485526 | 2020-11-27 13:58:46.000 
 1606485575 | 2020-11-27 13:59:35.000 

And we can optionally modify the format of the result by using date_format():

presto> select date_format(from_unixtime(col1),'%Y-%m-%d %h:%i%p') from table_a;
        _col0        
---------------------
 2020-11-27 01:58PM 
 2020-11-27 01:59PM 

That’s how to use from_unixtime() to convert bigint to timestamp. 

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

Presto Queries

How do I convert timestamp to date with Presto?

Luckily Presto has a wide range of conversion functions and they are listed in the docs.  Many of these allow us to specifically convert a timestamp type to a date type.

To test this out we can use Presto’s built-in current_timestamp function (an alias for the now() function) that returns the current system time as a timestamp:

presto> select current_timestamp as "Date & Time Here Now";
         Date & Time Here Now          
---------------------------------------
 2020-11-27 13:20:04.093 Europe/London 
(1 row)

To grab the date part of a timestamp we can simply cast to a date:

presto> select cast(current_timestamp as date) as "Today's date";
 Today's date 
--------------
 2020-11-27   
(1 row)

Or we can use date() which is an alias for cast(x as date):

presto> select date(current_timestamp) as "Today's date";
 Today's date 
--------------
 2020-11-27   
(1 row)

We can use date_format() which is one of Presto’s MySQL-compatible functions: 

presto:demo> select date_format(current_timestamp, '%Y%m%d') as "Today's date";
 Today's date  
----------
 20201127 
(1 row)

Finally we can use format_datetime() which uses a format string compatible with JodaTime’s DateTimeFormat pattern format:

presto:demo> select format_datetime(current_timestamp, 'Y-M-d') as "Date";
  Date  
----------
 2020-11-27 
(1 row)

The above 5 examples should allow you to convert timestamps to dates in any scenario.

Looking to learn more about Presto?

Download the free Ebook, Learning and Operating Presto

Common Questions/Errors:

Why am I getting the following error: ‘>’ cannot be applied to date, varchar (10)

Answer: You are trying to compare a string literal in your query to a date type in your schema. The operator in the error can be any operator >,=,<,>=,<=. To fix that, you need to cast the string to a date type or use a function to parse the string to date.

Example error: select now() as time_now where current_date = '2021-06-01'

Example solution:

Option 1: Cast 
select now() as time_now where current_date > date '2021-06-01'

Option 2: Cast Explicit
select now() as time_now where current_date > Cast ( '2021-06-01' as Date)

Option 3: Parse iso8601 date
select now() as time_now where current_date > from_iso8601_date('2021-06-01')

Option 4: MySQL parser, Non Standard Date format using date_parse 
select now() as time_now where current_date > date_parse('01-06-2021','%d-%m-%Y')

Option 5: Java DataTime for parser, Non Standard Date format using date_parse 
select now() as time_now where current_date > parse_datetime('01-06-2021','dd-MM-YYYY')

How do I convert string with timestamp to a date?

Answer: You are trying to accurately convert a string of the format such as ‘2021-05-13T19:02:57.415006+01:00’ to date and use it in your query for comparison. You need to parse the string using either date_parse or parse_datetime

Example problem: Need to convert the following timestamp to date: ‘2021-05-13T19:02:57.415006+01:00’

Example solution:

Option 1: MySQL parser, Non Standard Date format using 
select date_parse('2021-05-13T19:02:57.415006','%Y-%m-%dT%h:%i:%s.%f') AT TIME ZONE '+01:00'

Option 2: Java DateTime format parser, Non Standard Date format using 
select parse_datetime('2021-05-13T19:02:57.415006+01:00','YYYY-MM-dd''T''HH:mm:ss.SSSSSSZ') 

Looking for related content? How To Convert Date String to Presto Date Format

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

Case Sensitive Search Configuration with Presto

When dealing with character data, case sensitivity can be important when  searching for specific matches or patterns. But not all databases and query engines behave in the same way. Some are case insensitive by default, some are not. How do we configure things so they behave in the way we want?

Here’s an example of why we might need to take steps for case sensitive search configuration. We’re accessing a MySQL database directly:

mysql> select * from state where name='iowa';
+------+----+--------------+
| name | id | abbreviation |
+------+----+--------------+
| Iowa | 19 | IA           |
+------+----+--------------+
1 row in set (0.00 sec)

MySQL is case-insensitive by default. Even though the MySQL column contains the capitalized string ‘Iowa’ it still matched the query’s restriction of ‘iowa’.  This may be acceptable, but in some use cases it could lead to unexpected results.

Using Presto to access the same MySQL data source things behave differently, and arguably, in a more expected way:

presto:demo> select * from state where name='iowa';
 name | id | abbreviation 
------+----+--------------
(0 rows)
 
Query 20201120_151345_00001_wjx6r, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:07 [1 rows, 0B] [0 rows/s, 0B/s]
 
presto:demo> select * from state where name='Iowa';
 name | id | abbreviation 
------+----+--------------
 Iowa | 19 | IA           
(1 row)

Now we only get a match with ‘Iowa’, and not with ‘iowa’. Presto has essentially made this data source (MySQL) case sensitive, even though it is exactly the same database in both the above examples, with default configurations used.

Reconfigure Case Sensitivity 

With a RDBMS like MySQL you can configure the collation setting to control if you want case sensitivity or not.  You can set the collation at the database creation or table creation level as a part of the CREATE statement. Or you can use ALTER to change the collation of a database, table or individual column. This is described in MySQL’s documentation. 

But how do you change Presto to be case-insensitive? Presto does not support collation, because it is not a database and doesn’t store data. And there is no configuration parameter that controls this.  To manage case sensitivity in Presto, and mimic collation, we rewrite the query to force case insensitivity explicitly by using:

  • simple lower() or upper() functions

Example:

select * from state where lower(name)='california';
    name      
------------
 california 
 California 
 CALIFORNIA 
(3 rows)

This query has matched any upper/lower case combination in the table, mimicking case insensitivity.

Or regular expressions. 

Example:

select * from state where regexp_like(name, '(?i)california');
    name      
------------
 california 
 California 
 CALIFORNIA 
(3 rows)

The regular expression syntax (?i) means matches are case insensitive. 

When it comes to Case Sensitive Search Configuration you are now an eXpErT.

When should I use ORC versus Parquet when using Presto?

If you’re working with open data lakes using open source and open formats, you can have multiple formats. Presto works with both – ORC Presto and Parquet Presto. You’ll probably want to optimize for your workloads. 

Both ORC and Parquet store data in columns. For Presto Parquet, it is most efficient when it comes to storage and performance. ORC on the other hand is ideal for storing compact data and skipping over irrelevant data without complex or manually maintained indices. For example, ORC is typically better suited for dimension tables which are slightly smaller while Parquet works better for the fact tables, which are much bigger.

If you’re looking to get up and running quickly with Presto, you can check out Ahana Cloud. It’s a SaaS for Presto and takes care of all the configuration, tuning, deployment, etc.

What’s the advantage of having your own Hive metastore with Presto? How does it compare to Amazon Glue?

First let’s define what Apache Hive is versus Amazon Glue. Apache Hive reads, writes, and manages large datasets using SQL. Hive was built for Hadoop. AWS Glue is a fully managed ETL service for preparing and loading data for analytics. It automates ETL and handles the schemas and transformations. AWS Glue is serverless, so there’s no infrastructure needed to provision or manage it (you only pay for the resources used while your jobs are running).

Presto isn’t a database and does not come with a catalog, so you’d want to use Hive to read/write/manage your datasets. Presto abstracts a catalog like Hive underneath it. You can use the Glue catalog as the default Hive metastore for Presto.

With Ahana Cloud, you don’t really need to worry about integrating Hive and/or AWS Glue with Presto. Presto clusters created with Ahana come with a managed Hive metastore and pre-integrated Amazon S3 data lake bucket. Ahana takes care connecting external catalogs like Hive and Amazon Glue, so you can focus more on analytics and less on integrating your catalogs manually. You can also create managed tables as opposed to external tables.

How to Find Out Data Type of Value with Presto

Presto has a typeof() function to make finding out data types of values easy. This is particularly useful when you are getting values from nested maps for example and the data types need to be determined.

Here’s a simple example showing the type returned by Presto’s now() function:

presto:default> select now() as System_Timestamp, typeof( now() ) as "the datatype is";
           System_Timestamp            |     the datatype is      
---------------------------------------+--------------------------
 2020-11-18 15:15:09.872 Europe/London | timestamp with time zone 

Some more examples:

presto:default> select typeof( 'abc' ) as "the datatype is";
 the datatype is 
-----------------
 varchar(3)      
 
presto:default> select typeof( 42 ) as "the datatype is";
 the datatype is 
-----------------
 integer         
 
presto:default> select typeof( 9999999999 ) as "the datatype is";
 the datatype is 
-----------------
 bigint          
 
presto:default> select typeof( 3.14159 ) as "the datatype is";
 the datatype is 
-----------------
 decimal(6,5)

Armed with this info you should now be able to find out the data types of values.

How to Rotate Rows to Columns with Presto

Sometimes called pivoting, here is one example of how to switch columns to rows via rotation with Presto.  

Suppose we have rows of data like this:

'a', 9
'b', 8
'a', 7 

We want to pivot this data so that all the ‘a’ row values are arranged in one column, and all the ‘b’ row values are in a second column like this:

ab
9
8
7

To rotate from rows to columns we will add an id to make aggregation easy. We will name the output columns a and b for the Presto key, and we’ll include the id in our result set. This is how we do the rotation in Presto, using VALUES() to supply the test data, and simple conditional CASE WHEN END logic:

presto:default> SELECT id
, MAX(CASE WHEN key = 'a' THEN value END) AS a
, MAX(CASE WHEN key = 'b' THEN value END) AS b 
FROM (VALUES (1, 'a', 9), (2, 'b', 8), (3, 'a', 7 )) as test_data (id, key, value) 
GROUP BY id ORDER BY id;
 
 id |  a   |  b   
----+------+------
  1 |    9 | NULL 
  2 | NULL |    8 
  3 |    7 | NULL 
(3 rows)

There are other SQL options for transforming (pivoting) rows into columns – you can use the map_agg function for example.

The code sample and description here should help when you need to rotate data from rows to columns using Presto.

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

How do you rotate columns to rows with Presto?

Sometimes called unpivoting, here is one example of how to rotate column data with Presto.  

Suppose we have some integer data arranged in two columns called a and b:

ab
9
8
7

We want to rotate the columns into rows like this, where for every ‘a’ column value we now see a row labeled ‘a’, and ditto for the b values:

'a', 9
'b', 8
'a', 7

To rotate from columns to rows in Presto we’ll use a CTE and VALUES() to supply the test data, and simple conditional CASE WHEN END logic coupled with a sub-select and a UNION:

presto:default> with testdata(value_a, value_b) as ( VALUES (9,null), (null,8), (7,null) ) 
select key, value from 
(select 'a' as key, value_a as value 
from testdata 
UNION select 'b' as key, value_b as value 
from testdata) 
where value is not null;
 
 key | value 
-----+-------
 a   |     9 
 b   |     8 
 a   |     7 
(3 rows)

There are other SQL options for rotating (unpivoting) columns into rows: The CROSS JOIN unnest function is similar to LATERAL VIEW explode function.

The code sample and description here should help when you need to rotate data from columns to rows using Presto.

How to Lateral View Explode in Presto

Hive’s explode() function takes an array (or a map) as input and outputs the elements of the array (map) as separate rows. Explode is a built-in Table-Generating Function (UDTF) in hive and can be used in a SELECT expression list and as a part of LATERAL VIEW.

The explode function doesn’t exist in Presto; instead we can use Presto’s similar UNNEST. 

Here’s an example using test results data in json form as input, from which we compute the average score per student.  We use the WITH clause to define a common table expression (CTE) named example with a column alias name of data. The VALUES function returns a table rowset. 

WITH example(data) as 
(
    VALUES
    (json '{"result":[{"name":"Jarret","score":"90"},{"name":"Blanche","score":"95"}]}'),
    (json '{"result":[{"name":"Blanche","score":"76"},{"name":"Jarret","score":"88"}]}')
)
SELECT n.name as "Student Name", avg(n.score) as "Average Score"
FROM example
CROSS JOIN
    UNNEST ( 
        CAST (JSON_EXTRACT(data, '$.result')
        as ARRAY(ROW(name VARCHAR, score INTEGER )))
    ) as n
--WHERE n.name='Jarret'
GROUP BY n.name;

Student Name | Average Score 
-------------+---------------
 Jarret      |          89.0 
 Blanche     |          85.5 
(2 rows)

The UNNEST function takes an array within a column of a single row and returns the elements of the array as multiple rows.

CAST converts the JSON type to an ARRAY type which UNNEST requires.

JSON_EXTRACT uses a jsonPath expression to return the array value of the result key in the data.

This code sample and description should help when you need to to do a lateral view explode in Presto.

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

How do you CROSS JOIN UNNEST a JSON array in Presto?

Let’s take a look at how you can cross join unnest a JSON array by using Presto joins. Here’s an example using test results data in json form as input, from which we compute the average score per student.  We use the WITH clause to define a common table expression (CTE) named example with a column alias name of data. The VALUES function returns a table rowset. 

WITH example(data) as 
(
    VALUES
    (json '{"result":[{"name":"Jarret","score":"90"},{"name":"Blanche","score":"95"}]}'),
    (json '{"result":[{"name":"Blanche","score":"76"},{"name":"Jarret","score":"88"}]}')
)
SELECT n.name as "Student Name", avg(n.score) as "Average Score"
FROM example
CROSS JOIN
    UNNEST ( 
        CAST (JSON_EXTRACT(data, '$.result')
        as ARRAY(ROW(name VARCHAR, score INTEGER )))
    ) as n
--WHERE n.name='Jarret'
GROUP BY n.name;

Student Name | Average Score 
-------------+---------------
 Jarret      |          89.0 
 Blanche     |          85.5 
(2 rows)

The UNNEST function takes an array within a column of a single row and returns the elements of the array as multiple rows.

CAST converts the JSON type to an ARRAY type which UNNEST requires.

JSON_EXTRACT uses a jsonPath expression to return the array value of the result key in the data.

The UNNEST approach is similar to Hive’s explode function.

This code sample and description should help when you need to execute a cross join to unnest a JSON array. 

>> Looking for an easy way to get up and running with Presto? Take a look at Ahana Cloud. <<

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

How can you write the output of queries to S3 easily?

With Ahana Cloud, we’ve made it easy for you write output of queries to S3. While there’s a variety of formats, here’s an example:

presto> CREATE SCHEMA ahana_hive.s3_write WITH (location = 's3a://parquet-test-bucket/');
CREATE SCHEMA

presto> CREATE TABLE ahana_hive.s3_write.my_table
WITH (format = 'PARQUET')
AS SELECT <your query here> ;

Does Amazon Athena do joins across other data sources besides S3? Does Amazon Athena connect to other data sources?

With Amazon Athena you’re limited in scope when it comes to doing joins across other data sources like relational data systems and more. You have to set up a Lambda, which then connects with your database in the back, which is an additional piece you have to manage on your own.


With Ahana Cloud, you can directly connect to many different types of data sources including S3, MySQL, PostgreSQL, Redshift, Elastic, and more with a few clicks, and you’re ready to integrate with any cluster.

If I have catalogs connected and configurations attached to my Presto cluster, what happens when I take the cluster down?

If you’re managing Presto on your own, either through your own installation or through a service like AWS EMR or AWS Athena, you have to maintain and manage all of the catalogs and configurations attached to your cluster. That means that if you take your cluster down, you’ll lose those catalogs and configurations – they are not maintained for your cluster.


You can use the Ahana Cloud managed service for Presto to help with this. Ahana Cloud manages all of that for you, so you don’t have to worry about losing catalogs and configurations attached to your Presto cluster. You also get Presto bundled with data sources like the Hive metastore, Apache Superset, and more.

Check If Map Or Presto Array Is Empty Or Contains

When working with array (indexable lists) or map (key-value tuple) complex types in Presto, it is useful to be able to check if map or array is empty. The examples below should enable you to check if a map or array is empty.

Maps

First, let’s create a prestomap from an array of key values:

presto> select map(array['taco','donut'], array[1,2]) as myMap;
       myMap       
-------------------
 {donut=2, taco=1} 

We can use the CARDINALITY() function to check if the map contains anything:

presto> select cardinality(map(array['taco','donut'], array[1,2])) = 0 as is_empty;
 is_empty 
----------
 false    

presto> select cardinality(map(array[], array[])) = 0 as is_empty;
 is_empty 
----------
 true     

We can also compare a map with an empty map() to test if it is empty:

presto> select (map(array['taco'], array['donut']) = map()) as is_empty;
 is_empty 
----------
 false 

Arrays

Again, we can use the CARDINALITY() function to check if an array contains anything. Here’s an example using an array with 3 elements:

presto> SELECT cardinality(ARRAY['Ahana', 'Cloud', 'Presto']) = 0 as is_empty;
is_empty 
--------
 false  


presto> SELECT cardinality(ARRAY[]) = 0 as is_empty;
is_empty 
--------
 true  

 Another method to check if an array is empty is to compare it with array[] like this:

presto> SELECT (map_keys(map(ARRAY['Ahana'],ARRAY['Presto']))= array[]) as is_empty;
 is_empty 
----------
 false 

Tip: Be aware of null values. If an array contains nulls it is not considered to be empty:

presto> SELECT cardinality(ARRAY['', '', '']) = 0 as is_empty;
is_empty 
--------
 false  

presto> SELECT cardinality(ARRAY[null,null,null]) = 0 as is_empty;
is_empty 
--------
 false  

The above query examples should enable you to check if a map or array is empty.

If you want to get up and running with Presto, Ahana Cloud’s cloud managed service built for AWS is the easiest way to do that. See our docs for more details.

Export Result Of Select Statement in Presto

A common question is “how can I run a query and export result of select statement quickly and easily in Presto?”  You are in luck, as several solutions exist – your choice will depend on your preferred tool, the output format you want, and the size of the result-set. Here are two options.

Using Presto-CLI

If you are using the Presto command line tool presto-cli (or just presto on the Mac if you used brew install presto) then use the --output-format parameter and redirect the output to a regular file. For example:

$ presto --server https://myPrestoCluster.com \
--catalog glue --schema amazon \
--execute "select product_title, count(*) as Num_Reviews from reviews where lower(product_title) like '%tacos%' group by 1 order by 2 desc limit 20" \
--output-format CSV > top20_taco_reviews.csv

$ head top20_taco_reviews.csv
"Dragons Love Tacos","832"
"Vitacost Extra Virgin Certified Organic Coconut Oil 16 Oz","281"
"Vitacost 100% Pure Peppermint Oil -- 4 fl oz","178"
"Vitacost 100% Pure Lavender Oil -- 4 fl oz","168"
"Taco Tender Holder - Plastic Red Stand - Holds 3 Tacos","106"
"Vitacost Infant Health - Baby-D's Liquid Vitamin D Drops -- 400 IU - 1 fl oz","101"
"Vitacost Butterbur Extract - Standardized -- 75 mg - 120 Capsules","84"
"Tacos, Tortas, and Tamales: Flavors from the Griddles, Pots, and Streetside Kitchens of Mexico","63"
"Vitacost Pine Bark Extract - Standardized to 95% OPC -- 100 mg - 300 Capsules","51"
"Vegan Tacos: Authentic and Inspired Recipes for Mexico's Favorite Street Food","45"

There are several formats supported by Presto-CLI, the default being quoted CSV:

--output-format <output-format>
            Output format for batch mode [ALIGNED, VERTICAL, CSV, TSV,
            CSV_HEADER, TSV_HEADER, NULL] (default: CSV)

So if you want to see column headers in your CSV format output file use --output-format CSV_HEADER

The advantage of using this approach is speed.

Using a Database Tool

If you are using a third-party SQL database tool like DbVisualizer, DBeaver or SQuirreL SQL then the UI will allow you to save the query output. For example, using DbVisualizer, run your query and click the Export button.

The advantage of this method is the huge number of output formatting options on offer. The disadvantage is it is usually slower than using Presto-CLI. 

The two options above should help you export results of a Select statement.

Extract Keys in a Nested JSON Array Object With Presto

Let’s say we have some JSON that looks like this:

[{"id": 1, "value":"xxx"}, {"id": 2, "value":"yyy"}]

In this instance our json contains key value pairs. How can we decode that using PrestoDB SQL to extract only the keys (1,2 in this example)?  

We can extract keys in a nested JSON array object using SQL arrays, the JSON_PARSE function and the TRANSFORM function.  The TRANSFORM function’s syntax is:

transform(array(T), function(T, U)) -> array(U)

Returns an array that is the result of applying function to each element of array:

For this example query we’ll also use the VALUES function to provide the test data – VALUES is a handy way to define an inline  table – kind of like a dummy table.  

select 
TRANSFORM(CAST(JSON_PARSE(array) AS ARRAY<MAP<VARCHAR, VARCHAR>>), entry -> entry['id']) as My_Keys 
from (values ('[{"id": 1, "value":"xxx"}, {"id":2, "value":"yy"}]')) t(array);

 My_Keys 
--------
 [1, 2]  

This works by first creating an array containing the input JSON data called ‘array’.  We then use JSON_PARSE which converts a string or array containing JSON text into deserialized JSON values, and we cast those values as an array with two varchar elements.   TRANSFORM then returns an array containing only the elements we want  (we call our desired elements ‘entry’ in this example but it could be called anything).

So with Presto’s rich set of array and JSON functions we can see how to easily extract keys in a nested JSON array object. If you want to get up and running with Presto, Ahana Cloud’s cloud managed service built for AWS is the easiest way to do that. See our docs for more details.

interval

Generate Interval from Variable Using PrestoDB

There are numerous ways to generate interval from variable using PrestoDB SQL for different data types.

interval

Dates

Date intervals can be generated for a given date range using a combination of the following not-so-well-known SQL functions:

  • SEQUENCE() – generates a sequence of values between a given start and stop range. It generates an ARRAY type. Function syntax is:

sequence(start, stop, step) -> array(timestamp)

Generate a sequence of timestamps from start to stop, incrementing by step. The type of step can be INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH. Using step can choose the interval we need e.g. INTERVAL ‘1’ DAY

Note that there is an array size upper limit of 10000.

  • FROM_ISO8601_DATE() – Parses a ISO 8601 formatted string into a date.
  • VALUES() – defines an inline / dummy table.  You can use VALUES() anywhere a query can be used e.g. as the FROM clause of a SELECT, an INSERT
  • UNNEST() – this function is used to expand an ARRAY into a single column. 

Example 1

Here’s an example query showing how to generate dates for the first 2 weeks of June with an interval of 1 day .  We create an array with the sequence of dates we need called date_array, then convert the array into a series of values with a join.

SELECT CAST(date_column AS DATE) date_column 
FROM (
  VALUES (SEQUENCE(FROM_ISO8601_DATE('2020-06-01'),    
                   FROM_ISO8601_DATE('2020-06-14'), INTERVAL '1' DAY) ) ) AS t1(date_array) 
CROSS JOIN UNNEST(date_array) AS t2(date_column) ;

 date_column 
-------------
 2020-06-01  
 2020-06-02  
 2020-06-03  
 2020-06-04  
 2020-06-05  
 2020-06-06  
 2020-06-07  
 2020-06-08  
 2020-06-09  
 2020-06-10  
 2020-06-11  
 2020-06-12  
 2020-06-13  
 2020-06-14  
(14 rows)

Example 2

Here’s a slightly different approach this time using the DATE_ADD function. We specify a start date in a table and the result set begins with this given date plus the next n (10 in this example) dates. We’re still using an array but the array from the SEQUENCE function is fed straight to the UNNEST function.

create table table1(start_date date);

insert into table1 values date('2020-10-13');

select DATE_ADD('day',s.n,t.start_date) as date from table1 t cross join UNNEST(SEQUENCE(0,10)) s (n);
    date    
------------
 2020-10-13 
 2020-10-14 
 2020-10-15 
 2020-10-16 
 2020-10-17 
 2020-10-18 
 2020-10-19 
 2020-10-20 
 2020-10-21 
 2020-10-22 
 2020-10-23 
(11 rows)

Integers

Example

Here’s a simple example generating a sequence of integers from 20 to 0 in steps of -2:

 SELECT x  FROM UNNEST(sequence(20, 0, -2)) t(x);
 x  
----
 20 
 18 
 16 
 14 
 12 
 10 
  8 
  6 
  4 
  2 
  0 
(11 rows)

These examples should help you when trying to generate an interval from a variable.

What is an Open Data Lake in the Cloud?

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Check out this article for more information about data warehouses.

How to Presto Escape a Single Quote

In reality there will be some occasions when you need to use a quote character in your data, in your query, or in result sets, but you want the SQL interpreter to treat the quote as “just another character”, as opposed a quote which has special meaning – namely to denote a string literal in most SQL dialects. This is called escaping a character.

Some SQL dialects use a backslash as the escape character. Presto’s SQL needs no escape character; instead it uses a double character notation. Here’s a simple example of what I mean:

presto:default> select 'hello world';
 _col0 
-------
 hello world

What if I want single quotes to actually appear in the result?  I need to do this in Presto:

presto:default> select '''hello world''' as col1;
  col1   
---------
 'hello world' 

I have simply used two single quotes to tell the interpreter the places where “I really want to include a single quote”. 

Another example: What if there are single quotes in my table? 

presto:default> select * from mytable where column_a in ('Driver's License');

Query n failed: line 1:43: mismatched input 'in'. Expecting: 'AND', 'EXCEPT', 'GROUP', 'HAVING', 'INTERSECT', 'LIMIT', 'OR', 'ORDER', 'UNION', <EOF>

The query was rejected because the interpreter encountered an unexpected single quote in “Driver’s License”.  To handle this we use two single quotes:

presto:default> select * from mytable where column_a in ('Driver''s License');

(query runs ok)

So that’s how to escape a single quote in Presto.

Check if null and return default (NVL analog)

How to Check if null

SQL has a number of functions for dealing with null values, and everyone has their favorite. The complicating factor is that not all SQL engines support all the functions, and some SQL dialects have their own versions of the functions which may be completely different (remember the “S” in SQL stands for Structured, not Standard!)

Let’s look at the functions used to substitute alternative, or default, values for NULLs. These functions are really useful if you need to see zeros instead of nulls in your results, or if you need to avoid division by zero errors.  We’ll check if null and return default (nvl analog).

Oracle has the NVL(column_x, value_y) function which will return column_x UNLESS it is a null, in which case the value_y will be returned instead. 

In Microsoft SQL Server, the function is ISNULL(column, value)

In MySQL, you use IFNULL(column, value)

In most databases (MySQL included), you can use COALESCE(value1, value2, …) which returns the first non-null value in the list.

And in some databases, you will find the function NULLIF(value1, value2) which returns null if value1 equals value2, otherwise returns value1.

And in most databases, you can use good old CASE, for example:

SELECT a
       CASE
           WHEN a = 999 THEN 'this is too much'
           WHEN a is null THEN 'this is null'
           ELSE 'we are good'
       END
FROM …

Check if null | PrestoDB

What about PrestoDB’s support when it comes to handling nulls? How does Presto handle this? PrestoDB  supports the following:

FunctionDescription
COALESCE(value1, value2, …)returns the first non-null value (or column) in the list.
NULLIF(value1, value2) returns null if value1 equals value2, otherwise returns value1.
CASE..WHEN..END various forms supported – see https://prestodb.io/docs/current/functions/conditional.html for more information. Use with IS NULL to match null values. 
IF(condition, true_value, false_value)Evaluates and returns true_value if condition is true, otherwise false_value. Use with IS NULL to match null values. 

You’ll notice NVL() is not supported by PrestoDB. Use COALESCE(column_x, value_y) instead as a direct replacement. Or use IF(), or CASE statement. 

Some examples using PrestoDB:

--Check we have a null in our data
presto:default> select partkey, brand, container from parts_null_test where partkey=200001;
 partkey |   brand    | container 
---------+------------+-----------
  200001 | SuperBrand | NULL      


--Turn nulls into the string ‘Single’ using coalesce
presto:default> select partkey, brand, COALESCE(container,'SINGLE') as CONTAINER from parts_null_test where partkey=200001;
 partkey |   brand    | CONTAINER 
---------+------------+-----------
  200001 | SuperBrand | SINGLE    

--Use the IF() method to turn nulls into the string ‘Single’
presto:default> select partkey, brand, IF(container is null,'SINGLE', container) as CONTAINER from parts_null_test LIMIT 5;
 partkey |   brand    | CONTAINER  
---------+------------+------------
  200001 | SuperBrand | SINGLE     
  100001 | Brand#35   | JUMBO CASE 
  100002 | Brand#34   | WRAP CAN   
  100003 | Brand#21   | SM BOX     
  100004 | Brand#42   | SM CASE    
 
--Use a CASE statement
presto:default> select partkey, brand, CASE WHEN container is null THEN 'SINGLE' ELSE container END as CONTAINER from parts_null_test LIMIT 5;
 partkey |   brand    | CONTAINER  
---------+------------+------------
    7024 | Brand#52   | JUMBO PACK 
    7025 | Brand#15   | LG PACK    
    7026 | Brand#55   | LG BOX     
    7023 | Brand#22   | LG DRUM    
  200001 | SuperBrand | SINGLE     
check if null

The above information should help you to check if null and return default (NVL analog) using the Presto if, coalesce, and case statements ( COALESCE(), IF() or CASE) in your PrestoDB SQL queries.

What is an Open Data Lake in the Cloud?

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Check out this article for more information about data warehouses.

Static Date and Timestamp in Where Clause

Static Date & Timestamp

In this post we’ll look at the static date and timestamp in where clause when it comes to Presto. Many databases automatically convert between CHAR or VARCHAR and other types like DATE and TIMESTAMP as a convenience feature.  Using constants in a query are also often auto-converted.  Take these example queries which count rows where the table’s DATE column is equal to a specific date, or range of time:

Select count(*) from myTable where myDateCol = '2020-09-28';

Select count(*) from myTable where myTimeCol between '2020-09-28 14:00:00.000' and '2020-09-28 14:59:59.000';

The underlying table has columns defined as DATE and TIMESTAMP types.  But the query is actually providing date and timestamp literals in the form of strings!  Some databases will tolerate this, but Presto is stricter in this respect and will give you an error like this:

presto:sf1 > select count(*) from orders where orderdate < '2020-09-01';
Query failed: line 1:45: '<' cannot be applied to date, varchar(10)

With Presto you must either cast your data types or a slightly simpler way is to use the date or timestamp type constructors:

$ presto --schema sf1 --catalog tpch

presto:sf1> select count(*) from orders where orderdate < date '2020-09-01';
  _col0  
---------
 1500000 
(1 row)

> select count(*) from transactions where myTimestampCol between timestamp '2020-09-01 22:00:00.000' and timestamp '2020-09-01 22:59:59.000';
  _col0  
---------
   42 
(1 row)

--Using CAST() also works:
presto:sf1> select count(*) from orders where orderdate < CAST('2020-09-01' as DATE);
  _col0  
---------
 1500000 
(1 row)

In summary, by using CAST() or the DATE/TIMESTAMP type constructors you will be able to specify a static date and timestamp in where clause.

timestamp

What is an Open Data Lake in the Cloud?

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Check out this article for more information about data warehouses.

Where can I find the Presto Server Bootstrap logs?

Bootstrap logs

A common question is “where can I find the Presto server bootstrap logs?”  These are logs that indicate what is happening to the cluster on startup. The Presto server generates them, and the content looks like this:

2020-05-22T11:59:32.045-0400 INFO main io.airlift.log.Logging Logging to stderr
2020-05-22T11:59:32.049-0400 INFO main Bootstrap Loading configuration
2020-05-22T11:59:32.224-0400 INFO main Bootstrap Initializing logging
2020-05-22T11:59:33.268-0400 INFO main Bootstrap PROPERTY DEFAULT RUNTIME DESCRIPTION
2020-05-22T11:59:33.268-0400 INFO main Bootstrap discovery.uri null http://localhost:8080 Discovery service base URI
2020-05-22T11:59:33.268-0400 INFO main Bootstrap service-inventory.uri null null Service inventory base URI
2020-05-22T11:59:33.268-0400 INFO main Bootstrap service-inventory.update-interval 10.00s 10.00s Service inventory update interval
2020-05-22T11:59:33.268-0400 INFO main Bootstrap discovery.presto.pool general general
2020-05-22T11:59:33.268-0400 INFO main Bootstrap discovery.collector.pool general general
2020-05-22T11:59:33.270-0400 INFO main Bootstrap discovery.max-age 30.00s 30.00s
2020-05-22T11:59:33.270-0400 INFO main Bootstrap discovery.store-cache-ttl 1.00s 1.00s
2020-05-22T11:59:33.270-0400 INFO main Bootstrap dynamic.store.gc-interval 1.00h 1.00h
2020-05-22T11:59:33.271-0400 INFO main Bootstrap dynamic.store.remote.max-batch-size 1000 1000
...

In the above excerpt you can see Bootstrap logs specific lines.

There are 3 logs of interest :

server.logThis is the main log file used by Presto. If the server fails during initialization, it usually contains relevant information. It will automatically rotate and compress.
launcher.logThis log is created by the launcher and is connected to the server’s stdout and stderr streams. It will contain some log messages that occurred while initializing server logging and any error or diagnostic information generated by the JVM.
http-request.logThis is the HTTP request log, which contains every HTTP request received by the server. It will automatically rotate and compress.

The log files’ location depends on where you are running your PrestoDB cluster and how you have it configured:

  • If you are running on EC2 look here: /media/ephemeral0/presto/var/log/bootstrap.log 
  • Other places to check (not just for the server log) are: /var/log/ and /var/lib/presto/data/var/log/ and /var/presto/data/

Additional Bootstrap logs Support

If all else fails when looking for bootstrap logs, look in your config.properties file (look in /opt/presto-server/etc/) to see if there is anything deployment-specific configured for logging. 

Configuration tip: Presto needs a data directory for storing logs, etc. and it is recommended this is created in a data directory outside of the installation directory (which allows it to be easily preserved when upgrading Presto). 

  • Presto’s node properties file etc/node.properties sets the location of the data directory. 
  • In your node.properties file set the node.data-dir property to point to the location (filesystem path) of the intended data directory – Presto will store logs and other data here. 

Finally, if you are wondering where to start when diagnosing bootstrap issues, it is worth looking at the output in the bootstrap section of the server.log file first. 

Armed with the above information you should now be able to find the Presto Server Bootstrap logs.

Presto Metrics

Where can I find different Presto metrics for monitoring?

If you’re wondering “Where can I find different Presto metrics for monitoring?”, we’ll help explain it. There are several ways to monitor Presto metrics. Let’s look at some options.

1. Presto Metrics: Presto Console

Presto provides a web interface for monitoring and managing queries. The web interface is accessible on the Presto coordinator via HTTP, using the HTTP port number specified in the coordinator Config Properties (default is 8080). The console’s main page looks like this:

Presto Metrics

This main page has a list of queries along with information like unique query ID, query text, query state, percentage completed, username and source from which this query originated. The currently running queries are at the top of the page, followed by the most recently completed or failed queries.  A query can have one of several states:

  • QUEUED – Query has been accepted and is awaiting execution.
  • PLANNING – Query is being planned.
  • STARTING – Query execution is being started.
  • RUNNING – Query has at least one running task.
  • BLOCKED – Query is blocked and is waiting for resources (buffer space, memory, splits, etc.).
  • FINISHING – Query is finishing (e.g. commit for autocommit queries).
  • FINISHED – Query has finished executing and all output has been consumed.
  • FAILED – Query execution failed.

The following console screenshot shows an example of an executed Presto query example; here’s a breakdown of the stats:

  • The query was run by the root user, using the presto-cli tool. 
  • 54 splits were completed during execution.
  • It took 5.17 seconds (wall clock time) to run. 
  • The query consumed 19.7 secs of CPU time – this is greater than the wall clock time since there are multiple CPUs and multiple cores at work.  
  • In terms of (JVM) memory, the query used up to 59.2MB during execution.  
  • The “Cumulative User Memory” (35.5M) is the sum of all (user) memory consumption across all query stages from queryStats. The unit of this metric is M seconds which means the memory data size consumed by the user to execute the query in total, excluding the memory consumption of the system:
Screen Shot 2021 03 04 at 4.22.52 PM

Click on the query ID link (20200925 in this example) and you will see a LOT more detail:

Screen Shot 2021 03 04 at 4.23.42 PM
Screen Shot 2021 03 04 at 4.24.28 PM

Notice the “Live Plan” tab at top-right which will give you a graphical representation of the query plan, which you read from the bottom up:

Screen Shot 2021 03 04 at 4.24.51 PM

The plan is clickable – click a stage and you can drill down into more detail to monitor Presto metrics.

2. JMX

Java Management Extensions (JMX) provides information about the Java Virtual Machine and all of the software running inside it. 

JMX is actually a connector which has been configured so that chosen JMX information will be periodically dumped and stored in tables (in the “jmx” catalog)  which can be queried. Note this can be controlled in the properties file: /etc/catalog/jmx.properties

JMX is useful for debugging and monitoring Presto metrics. 

Here’s how to query it using presto-cli:

$ presto --schema jmx --catalog jmx
> select * from jmx.information_schema.tables;   -- lists tables in the information schema

> show tables from jmx.information_schema; 	    -- another way of listing tables

> select * from jmx.information_schema.views;    -- lists all views 

The most useful JMX schema for monitoring is “current” which contains every MBean from every node in the Presto cluster.  The following query uses the “current” schema to return information from the different Presto memory pools on each node:

> SELECT freebytes, node, object_name FROM jmx.current."com.facebook.presto.memory:*type=memorypool*";


 freebytes |     node     |                       object_name                        
-----------+--------------+----------------------------------------------------------
 322122547 | 4768d52a6258 | com.facebook.presto.memory:type=MemoryPool,name=reserved 
 429496730 | 4768d52a6258 | com.facebook.presto.memory:type=MemoryPool,name=general  
(2 rows)

More info on monitoring can be found in the docs – check out this blog for details on memory-related monitoring, management and memory pools: https://prestodb.io/blog/2019/08/19/memory-tracking 

3. REST API

You can make a simple REST call to Presto via the Presto REST API to get a dump of recently run queries using:

http://<prestoServerHost>:8080/v1/query

The response is in JSON format.

 You can optionally specify a query ID – in this example my query ID is 20200926_204458_00000_68x9u:

http://<prestoServerHost>:8080/v1/query/20200926_204458_00000_68x9u

4. Third-Party Tools

You can also monitor Presto metrics using third-party tools like Datadog, Prometheus, etc.

The above is not an exhaustive list, but I hope it helps you find different Presto metrics for monitoring.


What is an Open Data Lake in the Cloud?

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Check out this article for more information about data warehouses.

presto logs

Where to Find Presto Logs

If you’re wondering “where do I find Presto logs”, we’ll help explain it. Presto needs a data directory for storing logs, etc. and it is recommended this is created in a data directory outside of the installation directory (which allows it to be easily preserved when upgrading Presto). Presto’s node properties file etc/node.properties sets the location of the data directory. Set the node.data-dir property to point to the location (filesystem path) of the intended data directory – Presto will store logs (Presto logs EMR) and other data here. 

presto logs

For example, my environment is set to log to /var/lib/presto/data/var/log and in here we find http request and other logs: 

$ ls -al /var/lib/presto/data/var/log
-rw-r--r-- 1 root root 5202277 Sep 25 21:53 http-request.log
-rw-r--r-- 1 root root  238037 Sep 12 01:23 http-request.log-2020-09-11.0.log.gz
-rw-r--r-- 1 root root  636367 Sep 13 00:00 http-request.log-2020-09-12.0.log.gz
-rw-r--r-- 1 root root  583609 Sep 14 00:00 http-request.log-2020-09-13.0.log.gz
-rw-r--r-- 1 root root  297803 Sep 16 12:53 http-request.log-2020-09-14.0.log.gz
-rw-r--r-- 1 root root  225971 Sep 17 00:00 http-request.log-2020-09-16.0.log.gz
-rw-r--r-- 1 root root  221244 Sep 25 10:43 http-request.log-2020-09-17.0.log.gz
...

Note that logging levels are controlled by the etc/log.properties file. The level can be set to one of four levels: DEBUG, INFO, WARN and ERROR. 

Query Logs: Running/completed/failed queries are displayed in the Presto Console UI. You can also access query logs using the Presto Event Listener in combination with custom functions to listen to events happening inside the Presto engine and react to them. Event listeners are invoked for following events in Presto query workflow 1) Query creation, 2) Query completion, and 3) Split completion. There’s a worked example here http://dharmeshkakadia.github.io/presto-event-listener/ and the doc page is here https://prestodb.io/docs/current/develop/event-listener.html. This  method enables you to collect all the queries submitted to Presto for later analysis. 

Ahana & Presto Logs

If you are using Ahana Cloud then it exposes the query log in a catalog that can be easily queried, e.g. using presto-cli for convenience.  The catalog is called ahana_querylog and it uses the aforementioned Event Listener mechanism:

$ presto --server https://devtest.james.staging.ahn-dev.app 
presto:demo> show catalogs;
    Catalog     
----------------
 ahana_hive     
 ahana_querylog 
 glue           
 glue2          
 jmx            
 mysql-db       
 system         
 tpcds          
 tpch           
(9 rows)

presto:public> use ahana_querylog.public;
presto:public> show tables;

       Table       
-------------------
 querylog          
 schema_migrations 
(2 rows)

presto:public> select * from querylog;
      type      | cluster_name |           ts            | seq |    user     |                               
----------------+--------------+-------------------------+-----+-------------+-----------------
 queryCreated   | devtest      | 2020-09-22 13:03:32.000 |   1 | jamesmesney | show catalogs                 
 queryCreated   | devtest      | 2020-09-22 13:03:54.000 |   3 | jamesmesney | use ahana_querylog            
 queryCreated   | devtest      | 2020-09-22 13:04:15.000 |   4 | jamesmesney | use ahana_hive                
 queryCreated   | devtest      | 2020-09-22 13:06:28.000 |   5 | jamesmesney | SHOW FUNCTIONS                
 queryCreated   | devtest      | 2020-09-22 13:15:13.000 |   8 | jamesmesney | show catalogs                 
 queryCreated   | devtest      | 2020-09-22 13:15:19.000 |  10 | jamesmesney | use ahana_hive                
 queryCompleted | devtest      | 2020-09-22 13:15:19.000 |  11 | jamesmesney | use ahana_hive                
 queryCreated   | devtest      | 2020-09-22 13:15:20.000 |  13 | jamesmesney | SELECT table_name 
 queryCreated   | devtest      | 2020-09-22 13:15:24.000 |  15 | jamesmesney | show tables                   
 queryCreated   | devtest      | 2020-09-22 13:15:40.000 |  16 | jamesmesney | SHOW FUNCTIONS                
 queryCreated   | devtest      | 2020-09-22 13:15:44.000 |  21 | jamesmesney | show tables                   
 queryCompleted | devtest      | 2020-09-22 13:15:55.000 |  25 | jamesmesney | use ahana_querylog            
 queryCreated   | devtest      | 2020-09-22 13:15:20.000 |  12 | jamesmesney | SHOW FUNCTIONS                
 splitCompleted | devtest      | 2020-09-22 13:15:44.000 |  22 | NULL        | NULL                          
 queryCreated   | devtest      | 2020-09-22 13:15:55.000 |  24 | jamesmesney | use ahana_querylog     

So, now you know where and how to find Presto logs. Are you looking for more information on all-things Presto? If so, check out some of our past presentations.

What is an Open Data Lake in the Cloud?

The Open Data Lake in the cloud is the solution to the massive data problem. Many companies are adopting that architecture because of better price-performance, scale, and non-proprietary architecture.

Data Warehouse Concepts for Beginners

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Check out this article for more information about data warehouses.