ahana logo

AWS Redshift Query Limits

What is Amazon Redshift?

At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2. It has evolved and been enhanced since then into a powerful distributed system that can provide speedy results across millions of rows. Conceptually it is based on node clusters, with a leader node and compute nodes. The leader generates the execution plan for queries and distributes those tasks to the compute nodes. Scalability is achieved with elastic scaling that can add/modify worker nodes as needed and quickly. We’ll discuss the details in the article below.

Limitations of Using Amazon Redshift

There are of course Redshift limitations on many parameters, which Amazon refers to as “quotas”. There is a Redshift query limit, a database limit, a Redshift query size limit, and many others. These have default values from Amazon and are per AWS region. Some of these quotas can be increased by submitting an Amazon Redshift Limit Increase Form. Below is a table of some of these quota limitations.

Nodes per cluster128Yes
Nodes per region200Yes
Schemas per DB per cluster9,900No
Tables per node type9,900 – 100,000No
Query limit50No
Databases per cluster60No
Stored procedures per DB10,000No
Query size limit100,000 rowsYes
Saved queries2,500Yes
Correlated SubqueriesNeed to be rewrittenNo

AWS Redshift Performance

To start, Redshift is storing data in compressed, columnar format. This means that there is less area on disk to scan and less data that has to be moved around. Add to that indexing and you have the base recipe for high performance. In addition, Redshift maintains a results cache, so frequently executed queries are going to be highly performant. This is aided by the query plan optimization done in the leader node. Redshift also optimizes the data partitioning in a highly efficient manner to complement the optimizations done in the columnar data algorithms.


There are a robust number of scaling strategies available from Redshift. With just a few clicks in the AWS Redshift console, or even with a single API call, you can change node types, add nodes and pause/resume the cluster. You are also able to use Elastic Resize to dynamically adjust your provisioned capacity within a few minutes. A Resize Scheduler is also available where you can schedule changes, say for month-end processing for example. There is also Concurrency Scaling that can automatically provision additional capacity for dynamic workloads.


A lot of variables go into Redshift pricing depending on the scale and features you go with. All of the details and a pricing calculator can be found on the Amazon Redshift Pricing page. To give you a quick overview, however, prices start as low as $.25 per hour. Pricing is based on compute time and size and goes up to $13.04 per hour. Amazon provides some incentives to get you started and try out the service.

First, similar to the Ahana Cloud Commnity Edition, Redshift has a “Free Tier”, if your company has never created a Redshift cluster then you are eligible for a DC2 large node trial for two months. This provides 750 hours per month for free, which is enough to continuously run that DC2 node, with 160GB of compressed SSD storage. Once your trial expires or your usage exceeds 750 hours per month, you can either keep it running with their “on-demand” pricing, or shut it down.

Next, there is a $500 credit available to use their Amazon Redshift Serverless option if you have never used it before. This applies to both the compute and storage and how long it will last depends entirely on the compute capacity you selected, and your usage.

Then there is “on-demand” pricing. This option allows you to just pay for provisioned capacity by the hour with no commitments or upfront costs, partial hours are billed in one-second increments. Amazon allows you to pause and resume these nodes when you aren’t using them so you don’t continue to pay, and you also preserve what you have, you’ll only pay for backup storage.


Redshift provides a robust, scalable environment that is well suited to managing data in a data warehouse. Amazon provides a variety of ways to easily give Redshift a try without getting too tied in. Not all analytic workloads make sense in a data warehouse, however, and if you are already landing data into AWS S3, then you have the makings of a data lakehouse that can offer better price/performance. A managed Presto service, such as Ahana, can be the answer to that challenge.

ahana logo

What is AWS Redshift Spectrum?


What is Redshift Spectrum? Since there is a shared name with AWS Redshift, there is some confusion as to what AWS Redshift Spectrum is. To discuss that however, it’s important to know what AWS Redshift is, namely an Amazon data warehouse product that is based on PostgreSQL version 8.0.2.

Launched in 2017, Redshift Spectrum is a feature within Redshift that enables you to query data stored in AWS S3 using SQL. Spectrum allows you to do federated queries from within the Redshift SQL query editor to data in S3, while also being able to combine it with data in Redshift.

Benefits of Redshift Spectrum

When compared to a similar object-store SQL engine available from Amazon such as Athena, Redshift has significantly higher and more consistent performance. Athena uses pooled resources while Spectrum is based on your Redshift cluster size and is, therefore, a known quantity.

Spectrum allows you to access your data lake files from within your Redshift data warehouse without having to go through an ingestion process. This makes data management easier, while also reducing data latency since you aren’t waiting for ETL jobs to be written and processed.

With Spectrum, you continue to use SQL to connect to and read AWS S3 object stores in addition to Redshift, which means there are no new tools to learn and it allows you to leverage your existing skillsets. Under the hood, Spectrum is breaking the user queries into filtered subsets that run concurrently. These can be distributed across thousands of nodes to enhance the performance and can be scaled to query exabytes of data. The data is then sent back to your Redshift cluster for final processing.

Performance & Price

Redshift Spectrum is going to be as fast as the slowest data store in your aggregated query. If you are joining from Redshift to a terabyte-sized CSV file, the performance will be extremely slow. Connecting to a well-partitioned collection of column-based Parquet stores on the other hand will be much faster. Not having indexes on the object stores means that you really have to rely on the efficient organization of the files to get higher performance.

As to price, Spectrum follows the terabyte scan model that Amazon uses for a number of its products. You are billed per terabyte of data scanned, rounded up to the next megabyte, with a 10 MB minimum per query. For example, if you scan 10 GB of data, you will be charged $0.05. If you scan 1 TB of data, you will be charged $5.00. This does not include any fees for the Redshift cluster or the S3 storage.

Redshift and Redshift Spectrum Use Case

An example of combining Redshift and Redshift Spectrum could be a high-velocity eCommerce site that sells apparel. Your historical order history is contained in your Redshift data warehouse, but real-time orders are coming in through a Kafka stream and landing in S3 in Parquet format. Your organization needs to make an order decision for particular items because there is a long lead time. Redshift knows what you have done historically, but that S3 data is only processed monthly into Redshift. With Spectrum, the query can combine what is in Redshift and join that with the Parquet files on S3 to get an up-to-the-minute view of order volume so a more informed decision can be made.


Amazon Redshift Spectrum provides a layer of functionality to Redshift that allows you to interact with object stores in AWS S3 without building a whole other tech stack. It makes sense for companies who are using Redshift and need to stay there, but also need to make use of the data lake, or for companies that are considering leaving Redshift behind and going entirely to the data lake. Redshift Spectrum does not make sense for you if all your files are in the data lake. Spectrum is very expensive as the data grows, with no visibility on the queries, this is where a managed service like Ahana for Presto fits in.


How to Build a Data Lake Using Lake Formation on AWS


Briefly, AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach. AWS data lake formation builds on and works with the capabilities found in AWS Glue.

How it Works

Your root user can’t be your administrator for your data lake, so the first thing you want to do is create a new user that has full admin rights. Go to IAM and create that user and give them AdministratorAccess capability. Next, create an S3 bucket and any data directories you are going to use if you don’t already have something configured. Do that in the S3 segment of AWS as you would normally. If you already have an S3 location setup, you can skip that step. In either case, we then need to register that data lake location in Lake Formation. The Lake Formation menu looks like this:

Menu - AWS Lake Formation

Now with your Lake Formation registered data sources, you can create a database from those sources in Lake Formation, and from there, create your Glue Crawlers. The crawler will take that database that you created, and go into the S3 bucket, read the directory structure and files to create your tables and fields within the database. Once you’ve run your Crawler, you’ll see the tables and fields reflected under “Tables”. The crawler creates a meta-data catalog that provides the descriptions of the underlying data that is then presented to other tools to access, such as AWS Quicksight and Ahana Presto. Amazon provides this diagram:

AWS Lake Formation architecture

To summarize thus far, we’ve 

  • Created an admin user
  • Created an S3 bucket
  • Created three directories in the S3 bucket
  • Registered the S3 bucket as a data lake location


Having your data repositories registered and then created as a database in Lake Formation provides a number of advantages in terms of centralization of work. Fundamentally, the role of Lake Formation is to control access to data that you register. A combination of IAM roles and “Data lake permissions” is how you control this on a more macro level. Amazon shows the flow this way:

data lake permissions flow

Where the major advantages lie however, are with the “LF-Tags” and “LF-tag permissions”. This is where your granular security can be applied in a way that will greatly simplify your life. Leveraging Lake Formation we have two ways to assign and manage permissions to our catalog resources. There is “Named” based access and “Tag” based access.

manage permissions in catalog

Named-based access is what most people are familiar with. You select the principal, which can be an AWS user or group of users, and assign it access to a specific database or table. The Tag-based access control method uses Lake Formation tags, called “LF Tags”. These are attributes that are assigned to the data catalog resources, such as databases, tables, and even columns, to principals in our AWS account to manage authorizations to these resources. This is especially helpful with environments that are growing and/or changing rapidly where policy management can be onerous. Tags are essentially Key/Value stores that define these permissions:

  • Tags can be up to 128 characters long
  • Values can be up to 256 characters long
  • Up to 15 values per tag
  • Up to 50 LF-Tags per resource

Use Cases

If we wanted to control access to an employee table for example, such that HR could see everything, everyone in the company could see the names, titles, and departments of employees, and the outside world could only see job titles, we could set that up as:

  • Key = Employees
  • Values = HR, corp, public

Using this simplified view as an example:

lake formation use case flow_example

We have resources “employees” and “sales”, each with multiple tables, with multiple named rows. In a conventional security model, you would give the HR group full access to the employees resource, but all of the corp group would only have access to the “details” table. What if you needed to give access to position.title and payroll.date to the corp group? We would simply add the corp group LF Tag to those fields in addition to the details table, and now they can read those specific fields out of the other two tables, in addition to everything they can read in the details table. The corp group LF Tag permissions would look like this:

  • employees.details
  • employees.position.title
  • employees.payroll.date

If we were to control by named resources, it would require that each named person would have to be specifically allocated access to those databases and tables, and often there is no ability to control by column, so that part wouldn’t even be possible at a data level.


AWS Lake Formation really simplifies the process whereby you set up and manage your data lake infrastructure. Where it really shines is in the granular security that can be applied through the use of LF Tags. An AWS Lake Formation tutorial that really gets into the nitty-gritty can be found online from AWS or any number of third parties on YouTube. The open-source data lake has many advantages over a data warehouse and Lake Formation can help establish best practices and simplify getting started dramatically.

ahana logo

Amazon Redshift Pricing: An Ultimate Guide

AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand and is compatible with multitudes of AWS tools and technologies. AWS Redshift is considered the preferred cloud data warehouse of choice for most customers but the pricing is not simple, since it tries to accommodate different use cases and customers. Let us try to understand the pricing details of Amazon Redshift.

Understanding node types pricing

The Redshift cluster consists of multiple nodes allowing it to process data faster. This means Redshift performance depends on the node types and number of nodes. The node types can be dense compute nodes or Redshift managed storage nodes.

Dense Compute nodes: These nodes offer physical memory up to 244GB and storage capacity on SSD up to 2.5TB.

RA3 with managed Storage nodes: These nodes have physical memory up to 384GB and storage capacity on SSD up to 128TB. Additionally, when storage runs out on the nodes, Redshift will offload the data into S3. Below pricing for RA3 does not include the cost of managed storage.

Understanding node types pricing

Redshift spectrum pricing

Redshift spectrum is a serverless offering that allows running SQL queries directly against an AWS S3 data lake and it’s priced based on data scanned per TB.

redshift spectrum pricing example

Concurrency scaling

Amazon Redshift allows you to grab additional resources as needed and release them when they are not needed. Every day of typical usage up to one hour is free but every second beyond that is charged for additional resource usage.

A pricing example as stated by Amazon Redshift, A 10 DC2.8XL node Redshift cluster in the US-East will cost $48 per hour. Consider a scenario where two transient clusters are utilized for 5 mins beyond the free concurrency scaling credits. The per-second on-demand rate will be $48 X 1/3600 = $0.13 per second. The additional cost for concurrency scaling, in this case, is 0.013 per second X 300 seconds x 2 transient clusters = $8

Amazon Redshift managed storage pricing(RMS)

Managed storage comes with RA3 node types. Usage of managed storage is calculated hourly based on the total GB stored. Managed storage does not include backup storage charges due to automated and manual snapshots.

Amazon Redshift managed storage pricing(RMS)

Pricing example for managed storage pricing

100 GB stored for 15 days: 100 GB X 15days x (24hours/day) =36000 GB/hours

100 TB stored for 15days: 100TB X 1024GB/TB X 15 days x (24 hours/day) = 36,864000 GB-hours

Total usage in GB-hours: 36,000 GB-Hours + 36,864000 GB-hours = 36,900,000 GB-hours

Total usage in GB-Month = 36,900,000/720 hours per month = 51,250 GB-months

Total charges for the month will be 51,250 GB-month X $0.024 = $1230

Limitation of Redshift Pricing

As you can see, Redshift has few selected instance types with limited storage.

Customers could easily hit the ceiling in terms of the node storage, and Redshift managed storage is expensive for data growth.

Redshift spectrum (a serverless option) $5 scan per TB will be an expensive option and removes the ability for the customer to scale up/down the nodes to meet their performance requirements.

Due to these limitations, Redshift is often a less than ideal solution for use cases that require diverse access to very large volumes of data, such as exploratory data science and machine learning. In these cases, many organizations would gravitate towards storing the data on Amazon S3 in a data lakehouse architecture.

If your organization is struggling to accommodate advanced use cases in Redshift, or managing increasing cloud storage costs, check out Ahana – a powerful managed service for Presto which provides SQL on S3. Unlike Redshift spectrum, Ahana allows customers to choose the right instance type and scale up/down as needed and comes with a simple pricing model on the number of compute instances.

Want to learn from a real-life example? See how Blinkit cut their data delivery time from 24 hours to 10 minutes by moving from Redshift to Ahana – watch the case study here.

Presto vs Snowflake: Data Warehousing Comparisons

ahana logo

Snowflake vs Presto

This article touches on several basic elements to compare Presto and Snowflake.

To start, let’s define what each of these is. Presto is an open-source SQL query engine for data lakehouse analytics. It’s well known for ad hoc analytics on your data. One important thing to note is that Presto is not a database. You can’t store data in Presto but use it as a compute engine for your Data Lakehouse. You can use presto on not just the public cloud but as well as on private cloud infrastructures (on-premises or hosted).

Snowflake is a cloud data warehouse that offers a cloud-based data storage and analytics service. Snowflake runs completely on cloud infrastructure. Snowflake uses virtual compute instances for its compute needs and storage service for persistent storage of data. Snowflake cannot be run on private cloud infrastructures (on-premises or hosted).

Use cases: Snowflake vs. Presto

Snowflake is a cloud solution for your traditional data warehouse workloads such as reporting and dashboards. It is good for small-scale workloads; to move traditional batch-based reporting and dashboard-based analytics to the cloud. I discuss this limitation in the Scalability and Concurrency topic. 

Presto is not only a solution for reporting & dashboarding. With its connectors and their in-place execution, platform teams can quickly provide access to datasets that analysts have an interest in. Presto can also run queries in seconds. You can aggregate terabytes of data across multiple data sources and run efficient ETL queries. With Presto, users can query data across many different data sources including databases, data lakes, and data lakehouses.

Open Source Or Vendor lock-in

Snowflake is not Open Source Software. Data that has been aggregated and moved into Snowflake is in a proprietary format only available to Snowflake users. Surrendering all your data to the Snowflake data cloud model is the ideal recipe for vendor lock-in. 

Vendor Lock-In can lead to:

  • Excessive cost as you grow your data warehouse
  • When ingested into another system, data is typically locked into the formats of a closed source system
  • No community innovations or ways to leverage other innovative technologies and services to process that same data

Presto is an Open Source project, under the Apache 2.0 license, hosted by the Linux Foundation. Presto benefits from community innovation. An open-source project like Presto has many contributions from engineers across Twitter, Uber, Facebook, Bytedance, Ahana, and many more. Dedicated Ahana engineers are working on the new PrestoDB C++ execution engine aiming to bring high-performance data analytics to the Presto ecosystem. 

Open File Formats

Snowflake has chosen to use a micro-partition file format that is good for performance but closed source. The Snowflake engine cannot work directly with common open formats like Apache Parquet, Apache Avro, Apache ORC, etc. Data can be imported from these open formats to an internal Snowflake file format, but users miss out on performance optimizations that these open formats can bring to the engine, including dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes, avoid many small files, avoid few huge files, etc. 

On the other hand, Presto users can run ad-hoc, real-time analytics, with deep learning, on those same source files previously mentioned, without needing to copy files, so there’s more flexibility that users get with this open data lake architecture. Using open formats gives users the flexibility to pick the right engine for the right job without the need for an expensive migration. 

Open transaction format

Many organizations are adopting Data Lakehouse architecture and augmenting their current data warehouse. This brings the need for a transaction manager layer that can be supported by Apache Hudi, Apache Iceberg, or Delta Lake. Snowflake does not support all of these table formats. Presto supports all these table formats natively, allowing users more flexibility and choice. With ACID transaction support from these table formats, Presto is the SQL engine for Open Data Lakehouse. Moreover, Snowflake data warehouse doesn’t support semi/unstructured data workloads, AI/ML/data science workloads, whereas the data lakehouse does. 

Data Ownership

While Snowflake did decouple storage and compute, they did not decouple data ownership. . They  still own the compute layer as well as the storage layer. This means users must ingest data into Snowflake using a proprietary format, creating yet another copy of data and also requiring users to move their data out of their own environment. Users lose ownership of their data.

On other hand, Presto is a truly disaggregated stack that allows you to run your queries in a federated manner without any need to move your data and create multiple copies. At Ahana, users can define Presto clusters, and orchestrate and manage them in their own AWS account using cross-account roles. 

Scalability and Concurrency

With Snowflake you hit a limitation of running maximum concurrent users on a single virtual warehouse. If you have more than eight concurrent users, then you need to initiate another virtual warehouse. Query performance is good for simple queries, however, performance degrades as you apply more complex joins on large datasets and the only options available are limiting the data that you can query with Snowflake or adding more compute. Parallel writes also impact read operations and the recommendation is to have separate virtual warehouses.

Presto is designed from the ground up for fast analytic queries against data sets of any size and has been proven on petabytes of data, and supports 10-50s concurrent queries at a time

Cost of Snowflake

Users think of Snowflake as an easy and low-cost model. However, it gets very expensive and cost-prohibitive to ingest data into Snowflake. Very large amounts of data and enterprise-grade, long-running queries can result in significant costs associated with Snowflake as it requires the addition of more virtual data warehouses which can rapidly escalate costs. Basic performance improvement features like Materialized Views come with additional costs. As Snowflake is not fully decoupled, data is copied and stored into Snowflake’s managed cloud storage layer within Snowflake’s account. Hence, the users end up paying a higher cost to Snowflake than the cloud provider charges, not to mention the costs associated with cold data. Further, security features come at a higher price with a proprietary tag.

Open Source Presto is completely free. Users can run on-prem or in a cloud environment. Presto allows you to leave your data in the lowest cost storage options. You can create a portable query abstraction layer to future-proof your data architecture. Costs are for infrastructure, with no hidden cost for premium features. Data federation with Presto allows users to shrink the size of their data warehouse. By accessing the data where it is, users may cut the expenses of ETL development and maintenance associated with data transfer into a data warehouse. With Presto, you can also leverage storage savings by storing “cold” data in low-cost options like a data lake and “hot” data in a typical relational or non-relational database. 

Snowflake vs. Presto: In Summary

Snowflake is a well-known cloud data warehouse, but sometimes users need more than that – 

  1. Immediate data access as soon as it is written in a federated manner
  2. Eliminate lag associated with ETL migration when you can directly query from the source
  3. Flexible environment to run unstructured/ semi-structured or machine learning workloads
  4. Support for open file formats and storage standards to build open data lakehouse
  5. Open-source technologies to avoid vendor lock-in
  6. The cost-effective solution that is optimized for high concurrency and scalability. 

Presto can solve all these user needs in a more flexible, open-source, secure, scalable, secure, and cost-effective way. 

SaaS for Presto

If you want to use Presto, we’ve made it easy to get started in AWS. Ahana is a SaaS for Presto. With Ahana for Presto, you can run in containers on Amazon EKS making the service highly scalable & available. We have optimized Presto clusters with scale up and down compute as necessary which helps companies achieve cost control. With Ahana Cloud, you can easily integrate Presto with Apache Ranger or AWS Lake Formation and address your fine-grained access control needs. Creating a data lake with Presto and AWS Lake Formation is as simple as defining data sources and what data access and security policies you want to apply. 


Redshift Data Warehouse Architecture Explained

Amazon Redshift is a cloud data warehouse offered as a managed service by AWS, and a popular choice for business intelligence and reporting use cases (see What Is Redshift Used For?

You might already be familiar with Redshift basics – but in this article, we’ll dive a bit deeper to cover Redshift’s internal system design and how it fits into broader data lake and data warehouse architectures. Understanding these factors will help you reap the most benefits from your deployment while controlling your costs.

Redshift Architecture and Main Components

Redshift internal architecture

As with other relational databases, storage and compute in Redshift are coupled. Data from applications, files, and cloud storage can be loaded into the data warehouse using either native AWS services such as Amazon Appflow or through a variety of 3rd party apps such as Fivetran and Matillion. Many of these tools would also provide ELT capabilities to further cleanse, transform, and aggregate data after it has been loaded into Redshift.
Zooming in on the internal architecture, we can see that a Redshift cluster is composed of a leader node, and compute nodes that are divided into node slices, and databases. This design allows Redshift to dynamically allocate resources in order to efficiently answer queries.

Breaking Down the Redshift Cluster Components

  • The leader node is Redshift’s ‘brain’ and manages communications with external client programs as well as the internal communication between compute nodes. When a query is made, the leader node will parse it, compile the code and create an execution plan.
  • Compute nodes provide the ‘muscle’ – the physical resources required to perform the requested database operation. This is also where the data is actually stored. Each compute node has dedicated CPU, RAM and storage, and these differ according to the node type.
  • The execution plan distributes the workload between compute nodes, which process the data in parallel. The workload is further distributed within the node: each node is partitioned into node slices, and each node slice is allocated a portion of the compute node’s memory and disk, according to the amount of data it needs to crunch.
  • Intermediate results are sent back to the leader node, which performs the final aggregation and sends the results to client applications via ODBC or JDBC. These would frequently be reporting and visualization tools such as Tableau or Amazon Quicksight, or internal software applications that read data from Redshift.
  • Redshift’s Internal Network provides high-speed communication between the nodes within the cluster.
  • Each Redshift cluster can contain multiple databases, with resources dynamically allocated between them.

This AWS presentation offers more details about Redshift’s internal architecture, and a step-by-step breakdown of how queries are handled in Redshift and Redshift Spectrum:

Additional Performance Features

In addition to these core components, Redshift has multiple built-in features meant to improve performance:

  • Columnar storage: Redshift stores data in a column-oriented format rather than the row-based storage of traditional OLTP databases. This allows for more efficient compression and indexing.
  • Concurrency scaling: When a cluster receives a large number of requests, Redshift can automatically add resources to maintain consistent performance in read and write operations. 
  • Massively Parallel Processing (MPP): As described above, multiple compute nodes work on portions of the same query at the same time, ensuring final aggregations are returned faster.
  • Query optimizer: Redshift applies query optimizations that leverage its MPP capabilities and columnar data storage. This helps Redshift process complex SQL queries that could include multi-table joins and subqueries. 
  • Result caching: The results of certain types of queries can be stored in-memory on the leader node, which can also reduce query execution time..

Redshift vs Traditional Data Warehouses

While Redshift can replace many of the functions filled by ‘traditional’ data warehouses such as Oracle and Teradata, there are a few key differences to keep in mind:

  • Managed infrastructure: Redshift infrastructure is fully managed by AWS rather than its end users – including hardware provisioning, software patching, setup, configuration, monitoring nodes and drives, and backups.
  • Optimized for analytics: While Redshift is a relational database management system (RDBMS) based on PostgreSQL and supports standard SQL, it is optimized for analytics and reporting rather than transactional features that require very fast retrieval or updates of specific records.
  • Serverless capabilities: Introduced in 2018, Redshift serverless can be used to automatically provision compute resources after a specific SQL query is made, further abstracting infrastructure management by removing the need to size your cluster in advance.

Redshift Costs and Performance

Amazon Redshift pricing can get complicated and depends on many factors, so a full breakdown is beyond the scope of this article. There are three basic types of pricing models for Redshift usage:

  • On-demand instances are charged by the hour, with no long-term commitment or upfront fees. 
  • Reserved instances offer a discount for customers who are willing to commit to using Redshift for a longer period of time. 
  • Serverless instances are charged based on usage, so customers only pay for the capacity they consume.

The size of your dataset and the level of performance you need from Redshift will often dictate your costs. Unlike object stores such as Amazon S3, scaling storage is non-trivial from a cost perspective (due to Redshift’s coupled architecture). When implementing use cases that require granular historical datasets you might find yourself paying for very large clusters. 

Performance depends on the number of nodes in the cluster and the type of node – you can pay for more resources to guarantee better performance. Other pertinent factors are the distribution of data, the sort order of data, and the structure of the query. 

Finally, you should bear in mind that Redshift compiles code the first time a query is run, meaning queries might run faster from the second time onwards – making it more cost-effective for situations where the queries are more predictable (such as a BI dashboard that updates every day) rather than exploratory ad-hoc analysis.

Reducing Redshift Costs with a Lakehouse Architecture

We’ve worked with many companies who started out using Redshift when they didn’t have much data but found it difficult and costly to scale as their needs evolved. 

Companies can face rapid growth in data when they acquire more users, introduce new business systems, or simply want to perform deeper exploratory analysis that requires more granular datasets and longer data retention periods. With Redshift’s coupling of storage and compute, this can cause their costs to scale almost linearly with the size of their data.

At this stage, it makes sense to consider moving from a data warehouse architecture to a data lakehouse to leverage inexpensive storage on Amazon S3 while distributing ETL and SQL query workloads between multiple services.

Redshift Lakehouse Architecture

In this architecture, companies can continue to use Redshift for workloads that require consistent performance such as dashboard reporting, while leveraging best-in-class frameworks such as open-source Presto to run queries directly against Amazon S3. This allows organizations to analyze much more data – without having to constantly up or downsize their Redshift clusters, manage complex retention policies, or deal with unmanageable costs.
To learn more about what considerations you should be thinking about as you look at data warehouses or data lakes, check out this white paper by Ventana Research: Unlocking the Value of the Data Lake.

ahana logo

What is Amazon Redshift Used For?


Amazon Redshift is one of the most widely-used services in the AWS ecosystem and is a familiar component in many cloud architectures. In this article, we’ll cover the key facts you need to know about this cloud data warehouse and the use cases it is best suited for, as well as limitations and scenarios where you might want to consider alternatives.

What is Amazon Redshift?

Amazon Redshift is a fully managed cloud data warehouse offered by AWS. First introduced in 2012, today Redshift is used by thousands of customers, typically for workloads ranging from hundreds of gigabytes to petabytes of data.

Redshift is based on PostgreSQL 8.0.2 and supports standard SQL for database operations. Under the hood, various optimizations are implemented to provide fast performance even at larger data scales, including massively parallel processing (MPP) and read-optimized columnar storage.

What is a Redshift Cluster?

A Redshift cluster represents a group of nodes provisioned as resources for a specific data warehouse. Each cluster consists of a leader and compute nodes. When a query is executed, Redshift’s MPP design means it distributes the processing power needed to return the results of an SQL query between the available nodes automatically.

Determining cluster size depends on the amount of data stored in your database, the number of queries being executed, and the desired performance. 

Scaling and managing clusters can be done through the Redshift console, the AWS CLI, or programmatically through the Redshift Query API.

What Makes Redshift Unique?

When Redshift was first launched, it represented a true paradigm shift from traditional data warehouses provided by the likes of Oracle and Teradata. As a fully managed service, Redshift allowed development teams to shift their focus away from infrastructure and toward core application development. The ability to add compute resources automatically with just a few clicks or lines of code, rather than having to set up and configure hardware, was revolutionary and allowed for much faster application development cycles.

Today, many modern cloud data warehouses offer similar linear scaling and infrastructure-as-a-service functionality – with a few notable products including Snowflake and Google BigQuery. However, Redshift remains a very popular choice and is tightly integrated with other services in the AWS cloud ecosystem.

Amazon continues to improve Redshift, and in recent years has introduced federated query capabilities, serverless, and AQUA (hardware accelerated cache).

Redshift Use Cases

Redshift’s Postgres roots mean it is optimized for online analytical processing (OLAP) and business intelligence (BI) – typically executing complex SQL queries on large volumes of data rather than transactional processing which focuses on efficiently retrieving and manipulating a single row.

Some common use cases for Redshift include:

  • Enterprise data warehouse: Even smaller organizations often work with data from multiple sources such as advertising, CRM, and customer support. Redshift can be used as a centralized repository that stores data from different sources in a unified schema and structure to create a single source of truth, which can then feed enterprise-wide reporting and analytics.
  • BI and analytics: Redshift’s fast query execution against terabyte-scale data makes it an excellent choice for business intelligence use cases, and it is often used as the underlying database for BI tools such as Tableau (which otherwise might struggle to perform when querying or joining larger datasets).
  • Embedded analytics and analytics as a service: Some organizations might choose to monetize the data they collect by exposing it to customers. Redshift’s data sharing, search and aggregation capabilities make it viable for these scenarios, as it allows exposing only relevant subsets of data per customer while ensuring other databases, tables, or rows remain secure and private.
  • Production workloads: Redshift’s performance is consistent and predictable, as long as the cluster is adequately-resourced. This makes it a popular choice for data-driven applications, which might use data for reporting or perform calculations on it.
  • Change data capture and database migration: AWS Database Migration Service (DMS) can be used to replicate changes in an operational data store into Amazon Redshift. This is typically done to provide more flexible analytical capabilities, or when migrating from legacy data warehouses.

Redshift Challenges and Limitations 

While Amazon Redshift is a powerful and versatile data warehouse, it still suffers from the limitations of any relational database, including:

  • Costs: Since storage and compute are coupled, Redshift costs can quickly grow very high when working with larger datasets, or with streaming sources such as application logs.
  • Complex data ingestion: Unlike Amazon S3, Redshift does not support unstructured object storage. Data needs to be stored in tables with predefined schemas, which can often require complex ETL or ELT processes to be performed when data is written to Redshift. 
  • Access to historical data: Due to the above limiting factors, most organizations choose to store only a subset of raw data in Redshift, or limit the number of historical versions of the data that they retain. 
  • Vendor lock-in: Migrating data between relational databases is always a challenge due to the rigid schema and file formats used by each vendor. This can create significant vendor lock-in and make it difficult to use other tools to analyze or access data.

Due to these limitations, Redshift is often a less than ideal solution for use cases that require diverse access to very large volumes of data, such as exploratory data science and machine learning. In these cases, many organizations would gravitate towards storing the data on Amazon S3 in a data lakehouse architecture.

If your organization is struggling to accommodate advanced Redshift use cases, or managing increasing cloud storage costs, check out Ahana – a powerful managed service for Presto which provides SQL on S3. 

Want to learn from a real-life example? See how Blinkit cut their data delivery time from 24 hours to 10 minutes by moving from Redshift to Ahana – watch the case study here.

ETL process diagram

ETL vs ELT in Data Warehousing


ETL, or Extract Transform Load, is when an ETL tool or series of homegrown programs extracts data from a data source(s), often a relational database, and performs transformation functions. Those transformations could be data cleansing, standardizations, enrichment, etc., and then write (load) that data into a new repository, often a data warehouse. 

In the ETL process, an ETL tool or series of programs extracts the data from different RDBMS source systems, and then transforms the data, by applying calculations, concatenations, etc., and then loads the data into the Data Warehouse system.

ETL process diagram

ELT, or Extract Load Transform turns the ETL process around a little bit and has you extract the raw data out from the data source and directly load it into the destination, without any processing in between. The transformation process is then done “in place” in the destination repository. Generally, the raw data is stored indefinitely so various transformations and enrichments can all be done by users with access to it, using tools they are familiar with.

ELT, or Extract Load Transform

Both are data integration styles and have much in common with their ultimate goals, but are implemented very differently.

The Difference Between ETL and ELT

So how does ETL vs ELT break down?

DefinitionData is extracted from ‘n’ number of data sources. Transformed in a separate process, then loaded into the destination repository.Data is extracted from ‘n’ number of data sources and directly loaded into the destination repository. Transformation occurs inside the destination.
TransformationData is transformed within an intermediate processing step that is independent of extract and load.Data can be transformed on an ad-hoc basis during reads, or in batch and stored in another set of tables.
Code-Based TransformationsPrimarily executed in the compute-intensive transformation process.Primarily executed in the database but also done ad-hoc through analysis tools.
Data Lake SupportOnly in the sense that it can be utilized as storage for the transformation step.Well oriented for the data lake.
CostSpecialized servers for transformation can add significant costs.Object stores are very inexpensive, requiring no specialized servers.
MaintenanceAdditional servers add to the overall maintenance burden.Fewer systems mean less to maintain.
LoadingData has to be transformed prior to loading. Data is loaded directly into the destination system.
MaturityETL tools and methods have been around for decades and are well understood.Relatively new on the scene, with emerging standards and less experience.

Use Cases

Let’s take HIPAA as an example of data that would lend itself to ETL rather than ELT. The raw HIPAA data contains a lot of sensitive information about patients that isn’t allowed to be shared, so you would need to go through the transformation process prior to loading it to remove any of that sensitive information. Say your analysts were trying to track cancer treatments for different types of cancer across a geographic region. You would scrub your data down in the transformation process to include treatment dates, location, cancer type, age, gender, etc., but remove any identifying information about the patient.

An ELT approach makes more sense with a data lake where you have lots of structured, semi-structured, and unstructured data. This can also include high-velocity data where you are trying to make decisions in near real-time. Consider an MMORPG where you want to offer incentives to players in a particular region that have performed a particular task. That data is probably coming in through a streaming protocol such as Kafka and analysts are doing transforming jobs on the fly to distill it down to the necessary information to fuel the desired action.


In summary, the difference between ETL and ELT in data warehousing really comes down to how you are going to use the data as illustrated above. They satisfy very different use cases and require thoughtful planning and a good understanding of your environment and goals. If you’re exploring whether to use a data warehouse or a data lake, we have some resources that might be helpful. Check out our white paper on Unlocking the Business Value of the Data Lake which discusses the data lake approach in comparison to the data warehouse. 

Understanding AWS Athena Costs with Examples

What Is Amazon Athena? 

Since you’re reading this to understand Athena costs, you likely already know, so we’ll just very briefly touch on what it is. Amazon Athena is a managed serverless version of Presto. It provides a SQL query engine for analyzing unstructured data in AWS S3. The best use case is where reliable speed and scalability are not particularly important, meaning that, since there are no dedicated resources for the service, it will not perform in a consistent fashion. So, testing ideas, small use cases and quick ad-hoc analysis are where it makes the most sense.

How Much Does AWS Athena Cost?

An Athena query costs from $5 to $7 per terabyte scanned, depending on the region. Most materials you read will only quote the $5, but there are regions that cost $7, so keep that in mind. For our examples, we’ll use the $5 per terabyte as our base. There are no costs for failed queries, but any other charges such as the S3 storage will apply as usual for any service you are using.

AWS Athena Pricing Example

In this example, we have a screenshot from the Amazon Athena pricing calculator where we are assuming 1 query per work day per month, so 20 queries a month, that would scan 4TB of data. The cost per query works out as follows:

$5 per TB scanned * 4 TB scanned = $20 per query

So if we are doing that query 20 times per month, then we have 20 * $20 = $400 per month 

Service settings_Amazon Athena

You can mitigate these costs by storing your data compressed, if that is an option for you. A very conservative 2:1 compression rate would cut your costs in half to just $200 per month. Now, if you were to store your data in a columnar format like ORC or Parquet, then you can reduce your costs even further by only scanning the columns you need, instead of the entire row every time. We’ll use the same 50% notion where we now only have to look at half our data, and now our cost is down to $100 per month.

Let’s go ahead and try a larger example, and not even a crazy big one if you are using the data lake and doing serious processing. Let’s say you have 20 queries per day, and you are working on 100TB of uncompressed, row based data:

pricing calculator

That’s right, $304,000 per month. Twenty queries per day isn’t even unrealistic if you have some departments that are wanting to run some dashboard queries to get updates on various metrics. 


While we learned details about Athena pricing, we also saw how easy it would be to get hit with a giant bill unexpectedly. If you haven’t compressed your data, or reformatted it to reduce those costs and just dumped a bunch of CSV or JSON files into S3, then you can have a nasty surprise. If you unleash connections to Athena to your data consumers without any controls, you can also end up with some nasty surprises if they are firing off a lot of queries on a lot of data. It’s not hard to figure out what the cost will be for specific usage, and Amazon has provided the tools to do it.

If you’re an Athena user who’s not happy with costs, you’re not alone. We see many Athena users wanting more control over their deployment and in turn, costs. That’s where we can help – Ahana is SaaS for Presto (the same technology that Athena is running) that gives you more control over your deployment. Typically our customers see up to 5.5X price performance improvements on their queries as compared to Athena. 

You can learn more about how Ahana compares to AWS Athena in this comparison page.

The next EDW is the Open Data Lakehouse

5 Components of Data Warehouse Architecture

The Data Warehouse has been around for decades. Born in the 1980s, it addressed the need for optimized analytics on data. As companies’ business applications began to grow and generate/store more data, they needed a system that could both manage the data and analyze it. At a high level, database admins could pull data from their operational systems and add a schema to it via transformation before loading it into their data warehouse (this process is also known as ETL – Extract, Transform, Load). Schema is made up of metadata (data about the data) so users could easily find what they were looking for. The data warehouse could also connect to many different data sources, so it became an easier way to manage all of a company’s data for analysis.

As the data warehouse grew in popularity, more people within a company started using it to access data – and the data warehouse made it easy to do so with structured data. This is where metadata became important. Reporting and dashboarding became a key use case, and SQL (structured query language) became the de facto way of interacting with that data.

Here’s a quick high level architecture of the data warehouse:

Architecture of Data Warehouse
In this article we’ll look at the contextual requirements of a data warehouse, which are the five components of a data warehouse. 

Those include:
  • ETL
  • Metadata
  • SQL Query Processing
  • Data layer
  • Governance/security


As mentioned above, ETL stands for Extract, Transform, Load. When DBAs want to move data from a data source into their data warehouse, this is the process they use. In short, ETL converts data into a usable format so that once it’s in the data warehouse, it can be analyzed/queried/etc. For the purposes of this article, I won’t go into too much detail of how the entire ETL process works, but there are many different resources where you can learn about ETL.


Metadata is data about data. Basically, it describes all of the data that’s stored in a system to make it searchable. Some examples of metadata include authors, dates, or locations of an article, create date of a file, the size of a file, etc. Think of it like the titles of a column in a spreadsheet. Metadata allows you to organize your data to make it usable, so you can analyze it to create dashboards and reports.

SQL Query Processing

SQL is the de facto standard language for querying your data. This is the language that analysts use to pull out insights from their data stored in the data warehouse. Typically data warehouses have proprietary SQL query processing technologies tightly coupled with the compute. This allows for very high performance when it comes to your analytics. One thing to note, however, is that the cost of a data warehouse can start getting expensive the more data and SQL compute resources you have.

Data Layer

The data layer is the access layer that allows users to actually get to the data. This is typically where you’d find a data mart. This layer partitions segments of your data out depending on who you want to give access to, so you can get very granular across your organization. For instance, you may not want to give your Sales team access to your HR team’s data, and vice versa.


This is related to the data layer in that you need to be able to provide fine grained access and security policies across all of your organization’s data. Typically data warehouses have very good governance and security capabilities built in, so you don’t need to do a lot of custom engineering work to include this. It’s important to plan for governance and security as you add more data to your warehouse and as your company grows.

The Challenges with a Data Warehouse

Now that I’ve laid out the five key components of a data warehouse, let’s discuss some of the challenges of the data warehouse. As companies start housing more data and needing more advanced analytics and a wide range of data, the data warehouse starts to become expensive and not so flexible. If you want to analyze unstructured or semi-structured data, the data warehouse won’t work. 

We’re seeing more companies moving to the Data Lakehouse architecture, which helps to address the above. The Data Lakehouse allows you to run warehouse workloads on all kinds of data in an open and flexible architecture. Instead of a tightly coupled system, the Data Lakehouse is much more flexible and also can manage unstructured and semi-structured data like photos, videos, IoT data, and more. Here’s what that architecture looks like:

EDW is the open data lakehouse diagram

The Data Lakehouse can also support your data science, ML and AI workloads in addition to your reporting and dashboarding workloads.
If you’re interested in learning more about why companies are moving from the data warehouse to the data lakehouse, check out this free whitepaper on how to Unlock the Business Value of the Data Lake/Data Lakehouse.

ahana logo

Enterprise Data Lake Formation & Architecture on AWS

What is Enterprise Data Lake

An enterprise data lake is simply a data lake for enterprise-wide information sharing and storing of data. The key purpose of “Enterprise data lake” is to incorporate analytics on it to unlock business insights from the stored data.

Why AWS Lake formation for Enterprise Data Lake

The key purpose of “Enterprise Data Lake” is to run analytics to gain business insights. As part of that process,governance of data becomes more important to secure the access of data between different roles in the enterprise. AWS Lake Formation is a service that makes it easy to set up a secure data lake very quickly (in a matter of days), providing a governance layer for data lakes on AWS S3. 

Enterprise Data Lake Formation & Architecture on AWS

Enterprise data platforms need a simpler, scalable, and centralized way to define and enforce access policies on their data lakes . A policy based approach to allow their data lake consumers to use the analytics service of their choice, to best suit the operations they want to perform on the data. Although the existing method of using Amazon S3 bucket policies to manage access control is an option, when the number of combinations of access levels and users increase, it may not be an option for enterprises.

AWS Lake Formation allows enterprises to simplify and centralize access management. It allows organizations to manage access control for Amazon S3-based data lakes using familiar concepts of databases, tables, and columns (with more advanced options like row and cell-level security). 

Benefits of AWS Lake formation for Enterprise Data Lakes

  •  One schema – shareable with no dependency on architecture
  •  Share AWS Lake Formation databases and tables to any AWS accounts 
  •  No Amazon S3 policy edits required 
  •  Receivers of the data can use analytic service provider like Ahana to run analytics.
  •  There is no dependency between roles on how the data will be further shared.
  •  Centralized logging.

AWS Enterprise Lake Formation: To Summarize

AWS Lake Formation has been integrated with AWS partners like Ahana cloud, a managed service for SQL on data lakes. These services honor the Lake Formation permissions model out of the box, which makes it easy for customers to simplify, standardize, and scale data security management for data lakes.

ahana logo

How to Query Your JSON Data Using Amazon Athena

AWS Athena is Amazon’s serverless implementation of Presto, which means they generally have the same features. A popular use case is to use Athena to query Parquet, ORC, CSV and JSON files that are typically used for querying directly, or transformed and loaded into a data warehouse. Athena allows you to extract data from, and search for values and parse JSON data.

Using Athena to Query Nested JSON

To have Athena query nested JSON, we just need to follow some basic steps. In this example, we will use a “key=value” to query a nested value in a JSON. Consider the following AWS Athena JSON example:

    "name": "Sam",
    "age": 45,
    "cars": {
      "car1": {
        "make": "Honda"
      "car2": {
        "make": "Toyota"
      "car3": {
        "make": "Kia"
    "name": "Sally",
    "age": 21,
    "cars": {
      "car1": {
        "make": "Ford"
      "car2": {
        "make": "SAAB"
      "car3": {
        "make": "Kia"
    "name": "Bill",
    "age": 68,
    "cars": {
      "car1": {
        "make": "Honda"
      "car2": {
        "make": "Porsche"
      "car3": {
        "make": "Kia"

We want to retrieve all “name”, “age” and “car2” values out of the array:

SELECT name, age, cars.car2.make FROM the_table; 
name age cars.car2
Sam45 Toyota
Sally21 SAAB
Bill68 Porsche

That is a pretty simple use case of  retrieving certain fields out of the JSON. The complexity was the cars column with the key/value pairs and we needed to identify which field we wanted. Nested values in a JSON can be represented as “key=value”, “array of values” or “array of key=value” expressions. We’ll illustrate the latter two next.

How to Query a JSON Array with Athena

Abbreviating our previous example to illustrate how to query an array, we’ll use a car dealership and car models, such as:

	"dealership": "Family Honda",
	"models": [ "Civic", "Accord", "Odyssey", "Brio", "Pilot"]

We have to unnest the array and connect it to the original table:

SELECT dealership, cars FROM dataset
CROSS JOIN UNNEST(models) as t(cars)
Family Honda Civic
Family HondaAccord
Family HondaOdyssey
Family HondaBrio
Family HondaPilot

Finally we will show how to query nested JSON with an array of key values.

Query Nested JSON with an Array of Key Values

Continuing with the car metaphor, we’ll consider a dealership and the employees in an array:

dealership:= Family Honda

employee:= [{name=Allan, dept=service, age=45},{name=Bill, dept=sales, age=52},{name=Karen, dept=finance, age=32},{name=Terry, dept=admin, age=27}]

To query that data, we have to first unnest the array and then select the column we are interested in. Similar to the previous example, we will cross join the unnested column and then unnest it:

select dealership, employee_unnested from dataset
cross join unnest(dataset.employee) as t(employee2)
Family Honda {name=Allan, dept=service, age=45}
Family Honda{name=Bill, dept=sales, age=52}
Family Honda{name=Karen, dept=finance, age=32}
Family Honda{name=Terry, dept=admin, age=27}

By using the “.key”, we can now retrieve a specific column:

select dealership,employee_unnested.name,employee_unnested.dept,employe_unnested.age from dataset
cross join unnest(dataset.employee) as t(employee_unnested)
Family HondaAllenservice45
Family HondaBillsales52
Family HondaKarenfinance32
Family HondaTerryadmin27

Using these building blocks, you can start to test on your own JSON files using Athena to see what is possible. Athena, however, runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous Athena users that saw challenges around price performance and concurrency/deployment control. Keep in mind, Athena costs from $5 to around $7 dollars per terabyte scanned cost, depending on the region. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 

You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/

ahana logo

AWS Athena vs AWS Glue: What Are The Differences?

Amazon’s AWS platform has over 200 products and services, which can make understanding what each one does and how they relate confusing. Here, we are going to talk about AWS Athena vs Glue, which is an interesting pairing as they are both complementary and competitive. So, what are they exactly?

What is AWS Athena?

AWS Athena is a serverless implementation of Presto. Presto is an interactive query service that allows you to query structured or unstructured data straight out of S3 buckets.

What is AWS Glue?

AWS Glue is also serverless, but more of an ecosystem of tools to allow you to easily do schema discovery and ETL with auto-generated scripts that can be modified either visually or via editing the script. The most commonly known components of Glue are Glue Metastore and Glue ETL. Glue Metastore is a serverless hive compatible metastore which can be used in lieu of your own managed Hive. Glue ETL on the other hand is a Spark service which allows customers to run Spark jobs without worrying about the configuration, manageability and operationalization of the underlying Spark infrastructure. There are other services such as Glue Data Wrangler which we will keep outside the scope of this discussion.

AWS Athena vs AWS Glue

Where this turns from AWS Glue vs AWS Athena to AWS Glue working with Athena is with the Glue Catalog. The Glue catalog is used as a central hive-compatible metadata catalog for your data in AWS S3. It can be used across AWS services – Glue ETL, Athena, EMR, Lake formation, AI/ML etc. A key difference between Glue and Athena is that Athena is primarily used as a query tool for analytics and Glue is more of a transformation and data movement tool.

Some examples of how Glue and Athena can work together would be:

  col1 INT,
  col2 INT,
  str1 STRING,
  • Creating tables for Glue to use in ETL jobs. The table must have a property added to them called a classification, which identifies the format of the data. The classification values can be csv, parquet, orc, avro, or json. An example CREATE TABLE statement in Athena would be:
  • Transforming data into a format that is better optimized for query performance in Athena, which will also impact cost as well. So, converting a CSV or JSON file into Parquet for example.

Query S3 Using Athena & Glue

Now how about querying S3 data utilizing both Athena and Glue? There are a few steps to set it up, first, we’ll assume a simple CSV file with IoT data in it, such as:

We would first upload our data to an S3 bucket, and then initiate a Glue crawler job to infer the schema and make it available in the Glue catalog. We can now use Athena to perform SQL queries on this data. Let’s say we want to retrieve all rows where ‘att2’ is ‘Z’, the query looks like this:

SELECT * FROM my_table WHERE att2 = 'Z';

From here, you can perform any query you want, you can even use Glue to transform the source CSV file into a Parquet file and use the same SQL statement to read the data. You are insulated from the details of the backend as a data analyst using Athena, while the data engineers can optimize the source data for speed and cost using Glue.

AWS Athena is a great place to start if you are just getting started on the cloud and want to test the waters at low cost and minimal effort. Athena however quickly runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs will increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous Athena users that saw challenges around price performance and concurrency/deployment control. Ahana is also tightly integrated with the Glue metastore, making it simple to map and query your data. Keep in mind that Athena costs $5 per terabyte scanned cost. Ahana is priced purely at instance hours, and provides the power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 

You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/ 

Query editor

Querying Parquet Files using AWS Amazon Athena

Parquet is one of the latest file formats with many advantages over some of the more commonly used formats like CSV and JSON. Specifically, Parquet’s speed and efficiency of storing large volumes of data in a columnar format are big advantages that have made it more widely used. It supports many optimizations and stores metadata around its internal contents to support fast lookups and searches by modern distributed querying/compute engines like PrestoDB, Spark, Drill, etc. Here are steps to quickly get set up to query your parquet files with a service like Amazon Athena.


  • Sample Parquet Data –  https://ahana-labs.s3.amazonaws.com/movielens/ratings/ratings.parquet
  • AWS Account and Role with access to below services:
    • AWS S3
    • AWS Glue (Optional but highly recommended)
    • AWS Athena

Setting up the Storage

For this example we will be querying the parquet files from AWS S3. To do this, we must first upload the sample data to an S3 bucket. 

Log in to your AWS account and select the S3 service in the Amazon Console.

  1. Click on Create Bucket
  2. Choose a name that is unique. For this example I chose ‘athena-parquet-<your-initials>’. S3 is a global service so try to include a unique identifier so that you don’t choose a bucket that has already been created. 
  3. Scroll to the bottom and click Create Bucket
  4. Click on your newly created bucket
  5. Create a folder in the S3 bucket called ‘test-data’
  6. Click on the newly created folder
  7. Choose Upload Data and upload your parquet file(s).

Running a Glue Crawler

Now that the data is in S3, we need to define the metadata for the file. This can be tedious and involve using a different reader program to read the parquet file to understand the various column field names and types. Thankfully, AWS Glue provides a service that can scan the file and fill in the requisite metadata auto-magically. To do this, first navigate to the AWS Glue service in the AWS Console.

  1. On the AWS Glue main page, select ‘Crawlers’ from the left hand side column
  2. Click Add Crawler
  3. Pick a name for the crawler. For this demo I chose to use ‘athena-parquet-crawler’. Then choose Next.
  4. In Crawler Source Type, leave the settings as is (‘Data Stores’ and ‘Crawl all folders’) and choose Next.
  5. In Data Store under Include Path, type in the URL of your S3 bucket. It should be something like ‘s3://athena-parquet-<your-initials>/test-data/’.
  6. In IAM Role, choose Create an IAM Role and fill the suffix with something like ‘athena-parquet’. Alternatively, you can opt to use a different IAM role with permissions for that S3 bucket.
  7. For Frequency leave the setting as default and choose Next
  8. For Output, choose Add Database and create a database with the name ‘athena-parquet’. Then choose Next.
  9. Review and then choose Finish.
  10. AWS will prompt you if you would like to run the crawler. Choose Run it now or manually run the crawler by refreshing the page and selecting the crawler and choosing the action Run.
  11. Wait for the crawler to finish running. You should see the number 1 in the column Tables Added for the crawler.

Querying the Parquet file from AWS Athena

Now that the data and the metadata are created, we can use AWS Athena to query the parquet file. Choose the Athena service in the AWS Console.

  1. Choose Explore the Query Editor and it will take you to a page where you should immediately be able to see a UI like this:
  1. Before you can proceed, Athena will require you to set up a Query Results Location. Select the prompt and set the Query Result Location to ‘s3://athena-parquet-<your-initials>/test-results/’.
  2. Go back to the Editor and type the following statement: ‘SELECT * FROM test_data LIMIT 10;’ The table name will be based on the folder name you chose
  3. The final result should look something like this: in the S3 storage step.


Some of these steps, like using Glue Crawlers, aren’t required but are a better approach for handling Parquet files where the schema definition is unknown. Athena itself is a pretty handy service for getting hands on with the files themselves but it does come with some limitations. 

Those limitations include concurrency limits, price performance impact, and no control of your deployment. Many companies are moving to a managed service approach, which takes care of those issues. Learn more about AWS Athena limitations and why you should consider a managed service like Ahana for your SQL on S3 needs.


The Role of Blueprints in Lake Formation on AWS

Why does this matter?

There are 2 major steps to create a Data Lakehouse on AWS, first is to set up your S3-based Data Lake and second is to run analytical queries on your data lake. A popular SQL engine that you can use is Presto. This article is focused on the first step and how AWS Lake Formation Blueprints can make that easy and automated. Before you can run analytics to get insights, you need your data continuously pooling into your lake!

AWS Lake Formation helps with the time-consuming data wrangling involved with maintaining a Data Lake. It makes that simple and secure. In Lake Formation, there is the Workflows feature. Workflows encompasses a complex set of ETL jobs to load and update data. 

work flow diagram

What is a Blueprint?

A Lake Formation Blueprint allows you to easily stamp out and create workflows. This is an automation capability within Lake Formation. There are 3 types: Database snapshots, incremental database, and log file blueprints.

The database blueprints support automated data ingestion of sources like MySQL, PostgreSQL, SQL service to the Open Data Lake. It’s a point and click service with simple forms in the AWS console.

A Database snapshot does what it sounds like, it loads all the tables from a JDBC source to your lake. This is good when you want time stamped end-of-period snapshots to compare later.

An Incremental database also does what it sounds like, taking only the new data or the deltas into the data lake. This is faster and keeps the latest data in your data lake. The Incremental database blueprint uses bookmarks on columns for each successive incremental blueprint run. 

The Log file blueprint takes logs from various sources and loads them into the data lake. ELB logs, ALB logs, and Cloud Trail logs are an example of popular log files that can be loaded in bulk. 

Summary and how about Ahana Cloud?

Getting data into your data lake is easy, automated, and consistent with AWS Lake Formation. Once you have your data ingested, you can use a managed service like Ahana Cloud for Presto to enable fast queries on your data lake to derive important insights for your users. Ahana Cloud has integrations with AWS Lake Formation governance and security policies. See that page here: https://ahana.io/aws-lake-formation 

lake formation diagram
ahana logo

What Are The Differences Between AWS Redshift Spectrum vs AWS Athena?

While the thrust of this article is an AWS Redshift Spectrum vs Athena comparison, there can be some confusion with the difference between AWS Redshift Spectrum and AWS Redshift. Very briefly, Redshift is the storage layer/data warehouse, and Redshift Spectrum is an extension to Redshift that is a query engine.

Amazon Athena

Athena is Amazon’s standalone, serverless SQL query engine implementation of Presto that is used to query data stored on Amazon S3. It is fully managed by Amazon, there is nothing to setup, manage or configure. This also means that the performance can be very inconsistent as you have no dedicated compute resources.

Amazon Redshift Spectrum

Redshift Spectrum is an extension of Amazon Redshift. It is a serverless query engine that can query both AWS S3 data and tabular data in Redshift using SQL. This enables you to join data stored in external object stores with data stored in Redshift to perform more advanced queries.

Key Features & Differences: Redshift vs Athena

Athena and Redshift Spectrum offer similar functionality, namely, serverless query of S3 data using SQL. That makes them easy to manage and cost-effective as there is nothing to set up and you are only charged based on the amount of data scanned. S3 storage is significantly less expensive than a database on AWS for the same amount of data.

  • Both are serverless, however Spectrum resources are allocated based on your Redshift cluster size, while Athena relies on non-dedicated, pooled resources.
  • Spectrum actually does need a bit of cluster management, but Athena is truly serverless.
  • Performance for Athena depends on your S3 optimization, while Spectrum, as previously noted, depends on your Redshift cluster resources and S3 optimization. If you need a specific query to run more quickly, then you can allocate additional compute resources to it.
  • Redshift Spectrum runs in tandem with Amazon Redshift, while Athena is a standalone query engine for querying data stored in Amazon S3.
  • Spectrum provides more consistency in query performance while Athena has inconsistent results due to the pooled resources.
  • Athena is great for simpler interactive queries, while Spectrum is more oriented towards large, complex queries.
  • The cost for both is the same at $5 per compressed terabyte scanned, however with Spectrum, you must also consider the Redshift compute costs.
  • Both use AWS Glue for schema management, and while Athena is designed to work directly with Glue, Spectrum needs external tables to be configured for each Glue catalog schema.
  • Both support federated queries.


The functionality of each is very similar, namely using standard SQL to query the S3 object store. If you are working with Redshift, then Spectrum can join information in S3 with tables stored in Redshift directly. Athena also has a Redshift connector to allow for similar joins, however if you are using Redshift, it would likely make more sense to use Spectrum in this case.


Keep in mind that when working with S3 objects, these are not traditional databases, which means there are no indexes to be scanned or used for joins. If you are working with files with high-cardinality and trying to join them, you will likely have very poor performance.

When connecting to data sources other than S3, Athena has a connector ecosystem to work with, which provides a collection of sources that you can directly query with no copy required. Federated queries were added to Spectrum in 2020 and provide a similar capability with the added benefit of being able to perform transformations on the data and load it directly into Redshift tables.

AWS Athena vs Redshift: To Summarize

If you are already using Redshift, then Spectrum makes a lot of sense, but if you are just getting started with the cloud, then the Redshift ecosystem is likely overkill. AWS Athena is a good place to start if you are just getting started on the cloud and want to test the waters at low cost and minimal effort. Athena however quickly runs into challenges with regards to limits, concurrency, transparency and consistent performance. You can find more details here. Costs will increase significantly as the scanned data volume grows. 

At Ahana, many of our customers are previous Athena and/or Redshift users that saw challenges around price performance (Redshift) and concurrency/deployment control (Athena). Keep in mind that Athena and Redshift Spectrum provide the same $5 terabyte scanned cost while Ahana is priced purely at instance hours. The power of Presto, ease of setup and management, price-performance, and dedicated compute resources. 
You can learn more about how Ahana compares to Amazon Athena here: https://ahana.io/amazon-athena/

AWS Lake Formation vs AWS Glue – What are the differences?

As you start building your analytics stack in AWS, there are several AWS technologies to understand as you begin. In this article we’ll discuss two key technologies: AWS Lake Formation for security and governance and AWS Glue, a data catalog. For reference, AWS Lake Formation is built on AWS Glue, and both services share the same AWS Glue data catalog.

AWS Lake Formation 

AWS Lake Formation makes it easier for you to build, secure, and manage data lakes.

AWS Lake Formation gives you a central console where you can discover data sources, set up transformation jobs to move data to an Amazon Simple Storage Service (S3) data lake, remove duplicates and match records, catalog data for access by analytic tools, configure data access and security policies, and audit and control access from AWS analytic and ML services

For AWS users who want to get governance on their data lake, AWS Lake Formation is a service that makes it easy to set up a secure data lake very quickly (in a matter of days), providing a governance layer for Amazon S3. 

Lake Formation creates Glue workflows that integrates source tables, extract the data, and load it to Amazon S3 data lake

When to use AWS Lake Formation? 

  • Build data lakes quickly – this means days not months. You can move, store, update and catalog your data faster, plus automatically organize and optimize your data.
  • Add Authorization on your Data Lake  – You can centrally define and enforce security, governance, and auditing policies.
  • Make data easy to discover and share – Catalog all of your company’s data assets and easily share datasets between consumers.

What is AWS Glue?

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and join data for analytics, machine learning, and application development. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog which discovers and catalogs metadata about your data stores or data lake.  Using the AWS Glue Data Catalog, users can easily find and access data.

When to use AWS Glue?

  • Create a unified data catalog to find data across multiple data stores – View the Data Catalog to quickly search and discover the datasets that you own, and maintain the relevant metadata in one central repository.
  • Data Catalog for data lake analytics with S3 – Organize, cleanse, validate, and format data for storage in a data warehouse or data lake
  • Build ETL pipelines to ingest data into your S3 data lake. 

The data workflows initiated from AWS Lake Formation blueprint are nothing but AWS Glue workflows. You can view and manage these workflows in either the Lake Formation console and the AWS Glue console.

AWS Lake Formation vs AWS Glue: A Summary

AWS Lake formation simplifies security and governance on the Data Lake whereas AWS Glue simplifies the metadata and data discovery for Data Lake Analytics.

Check out our community roundtable where we discuss how you can build simple data lake with the new stack: Presto + Apache Hudi + AWS Glue and S3 = The PHAS3 stack

ahana logo

Amazon S3 Select Limitations

What is Amazon S3 Select?

Amazon S3 Select allows you to use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. 

Why use Amazon S3 Select?

Instead of pulling the entire dataset and then manually extracting the data that you need,  you can use S3 Select to filter this data at the source (i.e. S3). This reduces the amount of data that Amazon S3 transfers, which reduces the cost, latency, and data processing time at the client.

What formats are supported for S3 Select?

Currently Amazon S3 Select only works on objects stored in CSV, JSON, or Apache Parquet format. The stored objects can be compressed with GZIP or BZIP2 (for CSV and JSON objects only). The returned filtered results can be in CSV or JSON, and you can determine how the records in the result are delimited.

How can I use Amazon S3 Select standalone?

You can perform S3 Select SQL queries using AWS SDKs, the SELECT Object Content REST API, the AWS Command Line Interface (AWS CLI), or the Amazon S3 console. 

What are the limitations of S3 Select?

Amazon S3 Select supports a subset of SQL. For more information about the SQL elements that are supported by Amazon S3 Select, see SQL reference for Amazon S3 Select and S3 Glacier Select.

Additionally, the following limits apply when using Amazon S3 Select:

  • The maximum length of a SQL expression is 256 KB.
  • The maximum length of a record in the input or result is 1 MB.
  • Amazon S3 Select can only emit nested data using the JSON output format.
  • You cannot specify the S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, or REDUCED_REDUNDANCY storage classes. 

Additional limitations apply when using Amazon S3 Select with Parquet objects:

  • Amazon S3 Select supports only columnar compression using GZIP or Snappy.
  • Amazon S3 Select doesn’t support whole-object compression for Parquet objects.
  • Amazon S3 Select doesn’t support Parquet output. You must specify the output format as CSV or JSON.
  • The maximum uncompressed row group size is 256 MB.
  • You must use the data types specified in the object’s schema.
  • Selecting on a repeated field returns only the last value.

What is the difference between S3 Select and Presto?

S3 Select is a minimalistic version of pushdown to source with a limited support for the ANSI SQL Dialect. Presto on the other hand is a comprehensive ANSI SQL compliant query engine that can work with various data sources. Here is a quick comparison table.

ComparisonS3 SelectPresto
SQL DialectFairly LimitedComprehensive
Data Format SupportCSV, JSON, ParquetDelimited, CSV, RCFile, JSON, SequenceFile, ORC, Avro, and Parquet
Data SourcesS3 OnlyVarious (Over 26 open-source connectors)
Push-Down CapabilitiesLimited to supported formatsVaries by format and underlying connector

What is the difference between S3 Select and Athena?

Athena is Amazon’s fully managed service for Presto. As such the comparison between Athena and S3 select is the same as outlined above. For a more detailed understanding of the difference between Athena and Presto see here.

How does S3 Select work with Presto?

S3SelectPushdown can be enabled on your hive catalog as a configuration to enable pushing down projection (SELECT) and predicate (WHERE) processing to S3 Select. With S3SelectPushdown Presto only retrieves the required data from S3 instead of entire S3 objects reducing both latency and network usage.

Should I turn on S3 Select for my workload on Presto? 

S3SelectPushdown is disabled by default and you should enable it in production after proper benchmarking and cost analysis. The performance of S3SelectPushdown depends on the amount of data filtered by the query. Filtering a large number of rows should result in better performance. If the query doesn’t filter any data then pushdown may not add any additional value and the user will be charged for S3 Select requests.

We recommend that you benchmark your workloads with and without S3 Select to see if using it may be suitable for your workload. For more information on S3 Select request cost, please see Amazon S3 Cloud Storage Pricing.

Use the following guidelines to determine if S3 Select is a good fit for your workload:

  • Your query filters out more than half of the original data set.
  • Your query filter predicates use columns that have a data type supported by Presto and S3 Select. The TIMESTAMP, REAL, and DOUBLE data types are not supported by S3 Select Pushdown. We recommend using the decimal data type for numerical data. For more information about supported data types for S3 Select, see the Data Types documentation.
  • Your network connection between Amazon S3 and the Presto cluster has good transfer speed and available bandwidth (For the best performance on AWS, your cluster is ideally colocated in the same region and the VPC is configured to use the S3 Gateway endpoint).
  • Amazon S3 Select does not compress HTTP responses, so the response size may increase for compressed input files.

Additional Considerations and Limitations:

  • Only objects stored in CSV format are supported (Parquet is not supported in Presto via the S3 Select configuration). Objects can be uncompressed or optionally compressed with gzip or bzip2.
  • The “AllowQuotedRecordDelimiters” property is not supported. If this property is specified, the query fails.
  • Amazon S3 server-side encryption with customer-provided encryption keys (SSE-C) and client-side encryption is not supported.
  • S3 Select Pushdown is not a substitute for using columnar or compressed file formats such as ORC and Parquet.

S3 Select makes sense for my workload on Presto, how do I turn it on?

You can enable S3 Select Pushdown using the s3_select_pushdown_enabled Hive session property or using the hive.s3select-pushdown.enabled configuration property. The session property will override the config property, allowing you to enable or disable it on a per-query basis. You may need to turn connection properties such as hive.s3select-pushdown.max-connections depending upon your workload.

Ahana and Athena_image

Querying Amazon S3 Data Using AWS Athena

The data lake is becoming increasingly popular for more than just data storage. Now we see much more flexibility with what you can do with the data lake itself – add a query engine on top to get ad hoc analytics, reporting and dashboarding, machine learning, etc. 

How Athena works with Amazon S3

In AWS land, Amazon S3 is the de facto data lake. Many AWS users who want to start easily querying that data will use Amazon Athena, a serverless query service that allows you to run ad hoc analytics using SQL on your data. Amazon Athena is built on Presto, the open source SQL query engine that came out of Meta (Facebook) and is now an open source project housed under the Linux Foundation. One of the most popular use cases is to query S3 with Athena.

The good news about Amazon Athena is that it’s really easy to get up and running. You can simply add the service and start running queries on your S3 data lake right away. Because Athena is based on Presto, you can query data in many different formats including JSON, Apache Parquet, Apache ORC, CSV, and a few more. Many companies today use Athena to query S3.

How to query S3 using Athena

The first thing you’ll need to do is create a new bucket in AWS S3 (or you can you an existing, though for the purposes of testing it out creating a new bucket is probably helpful). You’ll use Athena to query S3 buckets. Next, open up your AWS Management Console and go to the Athena home page. From there you have a few options in how to create a table, for this example just select the “Create table from S3 bucket data” option. 

From there, AWS has made it fairly easy to get up and running in a quick 4 step process where you’ll define the database, table name, and S3 folder where data for this table will come from. You’ll select the data format, define your columns, and then set up your partitions (this is if you have a lot of data). Briefly laid out:

  1. Set up your Database, Table, and Folder Names & Locations
  2. Choose the data format you’ll be querying
  3. Define your columns so Athena understands your data schema
  4. Set up your Data Partitions if needed

Now you’re ready to start querying with Athena. You can run simple select statements on your data, giving you the ability to run SQL on your data lake.

When Athena hits its limits

While Athena is very easy to get up and running, it has known limitations that start impacting price performance as usage grows. That includes query limits, partition limits, deterministic performance, and some others. It’s actually why we see a lot of previous Athena users move to Ahana Cloud for Presto, our managed service for Presto on AWS. 

Here’s a quick comparison between the two offerings:

Looking at Amazon Athena and Ahana Cloud

Some of our customers shared why they moved from AWS Athena to Ahana Cloud. Adroitts saw 5.5X price performance improvement, faster queries, and more control after they made the switch, while SIEM leader Securonix saw 3X price performance improvement along with better performing queries.

We can help you benchmark Athena against Ahana Cloud, get in touch with us today and let’s set up a call.

lake formation_image

What is AWS Lake Formation?

For AWS users who want to get governance on their data lake, AWS Lake Formation is a service that makes it easy to set up a secure data lake very quickly (in a matter of days), providing a governance layer for Amazon S3. 

We’re seeing more companies move to the data lake because it’s flexible, cheaper, and much easier to use than a data warehouse. You’re not locked into proprietary formats, nor do you have to ingest all of your data into a proprietary technology. As more companies are leveraging the data lake, then security becomes even more important because you have more people needing access to that data and you want to be able to control who sees what. 

AWS Lake Formation can help address security on the data lake. For Amazon S3 users, it’s a seamless integration that allows you to get granular security policies in place on your data. AWS Lake Formation gives you three key capabilities:

  1. Build data lakes quickly – this means days not months. You can move, store, update and catalog your data faster, plus automatically organize and optimize your data.
  2. Simplify security management – You can centrally define and enforce security, governance, and auditing policies.
  3. Make data easy to discover and share – Catalog all of your company’s data assets and easily share datasets between consumers.

If you’re currently using AWS S3 or planning to, we recommend looking at AWS Lake Formation as an easy way to get security policies in place on your data lake. As part of your stack, you’ll also need a query engine that will allow you to get analytics on your data lake. The most popular engine to do that is Presto, an open source SQL query engine built for the data lake.

At Ahana, we’ve made it easy to get started with this stack: AWS S3 + Presto + AWS Lake Formation. We provide SaaS for Presto with out of the box integrations with S3 and Lake Formation, so you can get a full data lake analytics stack up and running in a matter of hours.

AWS lake formation diagram

Check out our webinar where we share more about our integration with AWS Lake Formation and how you can actually enforce security policies across your organization.

ahana logo

How does Presto Work With LDAP?

What is LDAP?

The Lightweight Directory Access Protocol (LDAP) is an open, vendor-neutral, industry standard application protocol used for directory services authentication. In LDAP user authentication, the LDAP server authenticates users to directly communicate with the Presto server. 

Presto & LDAP

Presto can be configured to enable LDAP authentication over HTTPS for clients, such as the Presto CLI, or the JDBC and ODBC drivers. At present only a simple LDAP authentication mechanism involving username and password is supported. The Presto client sends a username and password to the coordinator and the coordinator validates these credentials using an external LDAP service.

To enable LDAP authentication for Presto, the Presto coordinator configuration file needs to be updated with LDAP-related configurations. No changes are required to the worker configuration; only the communication from the clients to the coordinator is authenticated. However, if you want to secure the communication between Presto nodes then you should configure Secure Internal Communication with SSL/TLS.

Summary of Steps to Configure LDAP Authentication with Presto:

Step 1: Gather configuration details about your LDAP server

Presto requires Secure LDAP (LDAPS), so make sure you have TLS enabled on your LDAP server as well.

Step 2: Configure SSL/TSL on Presto Coordinator

Access to the Presto coordinator must be through HTTPS when using LDAP authentication.

Step 3: Configure Presto Coordinator with config.properties for LDAP

Step 4: Create a Password Authenticator Configuration (etc/password-authenticator.properties) file on the coordinator

Step 5: Configure Client / Presto CLI with either a Java Keystore file or Java Truststore for its TLS configuration.

Step 6: Restart your Presto Cluster and invoke the CLI with LDAP enabled CLI with  either –keystore-* or –truststore-* or both properties to secure TLS connection.

Reference: https://prestodb.io/docs/current/security/ldap.html

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our presentation with AWS on how to get started in 30min with Presto in the cloud.

Apache Ranger plugin-diagram

What is Apache Ranger?

Apache Ranger™ is a framework to enable, monitor and manage comprehensive data security across the data platform. It is an open-source authorization solution that provides access control and audit capabilities for big data platforms through centralized security administration.

Its open data governance model and plugin architecture enabled the extension of access control to other projects beyond the Hadoop ecosystem, and the platform is widely accepted among “major cloud vendors like AWS, Azure, GCP”. 

With the help of the Apache Ranger console, admins can easily manage centralized, fine-grained access control policies, including file, folder, database, table and column-level policies across all clusters. These policies can be defined at user level, role level or group level.

Apache Service Integration

Apache Ranger uses plugin architecture in order to allow other services to integrate seamlessly with authorization controls.

Apache Ranger plugin diagram

Figure: Simple sequence diagram showing how the Apache Ranger plugin enforces authorization policies with Presto Server.

Apache Ranger also supports centralized auditing of user access and administrative actions for comprehensive visibility of sensitive data usage through a centralized audit store that tracks all the access requests in real time and supports multiple audit stores including Elasticsearch and Solr.

Many companies are today looking to leverage the Open Data Lake Analytics stack, which is the open and flexible alternative to the data warehouse. In this stack, you have flexibility when it comes to your storage, compute, and security to get SQL on your data lake. With Ahana Cloud, the stack includes AWS S3, Presto, and in this case our Apache Ranger integration. 

Ahana Cloud for Presto and Apache Ranger

Ahana-managed Presto clusters can take advantage of Apache Ranger Integration to enforce access control policies defined in Apache Ranger. Ahana Cloud for Presto enables you to get up and running with the Open Data Lake Analytics stack in 30 minutes. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. Check out our on-demand webinar where we share how you can build an Open Data Lake Analytics stack we hosted with Dzone.

What is a Data Lakehouse Architecture?

The term Data Lakehouse has become very popular over the last year or so, especially as more customers are migrating their workloads to the cloud. This article will help to explain what a Data Lakehouse is, the common architecture of a Data Lakehouse, and how companies are using the Data Lakehouse in production today. Finally, we’ll share a bit on where Ahana Cloud for Presto fits into this architecture and how real companies are leveraging Ahana as the query engine for their Data Lakehouse.

What is a Data Lakehouse?

First, it’s best to explain a Data Warehouse and a Data Lake.

Data Warehouse

A data warehouse is one central place where you can store specific, structured data. Most of the time that’s relational data that comes from transactional systems, business apps, and operational databases. You can run fast analytics on the Data Warehouse with very good price/performance. Using a data warehouse typically means you’re locked into that Data Warehouse’s proprietary formats – the trade off for the speed and price/performance is your data is ingested and locked into that warehouse, so you lose the flexibility of a more open solution.

Data Lake

On the other hand, a Data Lake is one central place where you can store any kind of data you want – structured, unstructured, etc. – at scale. Popular Data Lakes are AWS S3, Microsoft Azure, and Google Cloud Storage. Data Lakes are widely popular because they are very cheap and easy to use – you can literally store an unlimited amount of any kind of data you want at a very low cost. However, the data lake doesn’t provide built-in mechanisms like query, analytics, etc. You need a query engine and data catalog on top of the data lake to query your data and make use of it (that’s where Ahana Cloud comes in, but more on that later).

Data Lakehouse explained_diagram

Data Lakehouse

Now let’s look at the Data Lake vs the Lakehouse. This new data lakehouse architecture has emerged that takes the best of the Data Warehouse and Data Lake. That means it’s open, flexible, has good price/performance, and can scale like the Data Lake, and can also do transactions and have strong security like that of the Data Warehouse.

Data Lakehouse Architecture Explained

Here’s an example of a Data Lakehouse architecture:

An example of a Data Lakehouse architecture

You’ll see the key components include your Cloud Data Lake, your catalog & governance layer, and the data processing (SQL query engine). On top of that you can run your BI, ML, Reporting, and Data Science tools. 

There are a few key characteristics of the Data Lakehouse. First, it’s based on open data formats – think ORC, Parquet, etc. That means you’re not locked into a proprietary format and can use an open source query engine to analyze your data. Your lakehouse data can be easily queried with SQL engines.

Second, a governance/security layer on top of the data lake is important to provide fine-grained access control to data. Last, performance is critical in the Data Lakehouse. To compete with data warehouse workloads, the data lakehouse needs a high-performing SQL query engine on top. That’s where open source Presto comes in, which can provide that extreme performance to give you similar, if not better, price/performance for your queries.

Building your Data Lakehouse with Ahana Cloud for Presto

At the heart of the Data Lakehouse is your high-performance SQL query engine. That’s what enables you to get high performance analytics on your data lake data. Ahana Cloud for Presto is SaaS for Presto on AWS, a really easy way to get up and running with Presto in the cloud (it takes under an hour). This is what your Data Lakehouse architecture would look like if you were using Ahana Cloud:

Building your Data Lakehouse with Ahana Cloud for Presto_diagram

Ahana comes built-in with a data catalog and caching for your S3-based data lake. With Ahana you get the capabilities of Presto without having to manage the overhead – Ahana takes care of it for you under the hood. The stack also includes and integrates with transaction managers like Apache Hudi, Delta Lake, and AWS Lake Formation.

We shared more on how to unlock your data lake with Ahana Cloud in the data lakehouse stack in a free on-demand webinar.

Ready to start building your Data Lakehouse? Try it out with Ahana. We have a 14-day free trial (no credit card required), and in under 1 hour you’ll have SQL running on your S3 data lake.

How to use mathematical functions and operators and aggregate functions for Presto?

Presto offers several classes of mathematical functions that operate on single values and mathematical operators that allow for operations on values across columns. In addition, aggregate functions can operator on a set of values to compute a single result.

The mathematical functions are broken into four subcategories: 1. mathematical, 2. statistical, 3. trigonometric, and 4. floating point. The majority fall into the mathematical category and we’ll discuss them separately. The statistical functions are quite sparse with two functions that compute the lower and upper bound of the Wilson score interval of a Bernoulli process. The trigonometric functions are what you’d expect (e.g. sin, cos, tan, etc.). The floating point functions are really functions that handle not-a-number and infinite use cases.

The mathematical functions subcategory further fall into another layer of classification:

  1. Functions that perform coarser approximation, such as rounding and truncation: abs, ceiling (ceil), floor, round, sign, truncate
  2. Conversions: degrees, radians, from_base, to_base
  3. Exponents, logarithms, roots: exp, ln, log2, log10, power (pow), cbrt, sqrt
  4. Convenient constants, such as pi(), e(), random (rand)
  5. Cumulative distribution functions (and inverses):binomial_cdf, inverse_binomial_cdf, cauchy_cdf, inverse_cauchy_cdf, chi_squared_cdf, inverse_chi_squared_cdf, normal_cdf, inverse_normal_cdf, poisson_cdf, inverse_poisson_cdf, weibull_cdf, inverse_weibull_cdf, beta_cdf, inverse_beta_cdf, width_bucket
  6. Miscellaneous: mod, cosine_similarity

The mathematical operators are basic arithmetic operators, such as addition (+), subtraction (-), multiplication (*), and modulus (%).

Let’s apply these mathematical functions in an example. In the following query, have a floating-point column x to which we apply several mathematical functions that are representative of the subcategories we discussed previously, including: radians (conversion), natural log, the Normal CDF, modulo, random number, and operators.

	radians(x) as radians_x,							/* convert to radians */
	ln(x) as ln_x,												/* natural log */
	normal_cdf(0,30,x) as_normal_cdf_x,		/* Normal CDF */
	mod(x,2) as mod_x_2,									/* Modulo 2 */
	random() as r,												/* Random number */
	3*((x/2)+2) as formula								/* Formula using operators */

The following is the output the above query with some rounding for ease of viewing.

So, far we see that mathematical functions, as they are classified in Presto, operate on single values. What this means is that given a column of values, each function is applied element-wise to that column. Aggregate functions allow us to look across a set of values.

Like mathematical functions, aggregate functions are also broken into subcategories: 1. general, 2. bitwise, 3. map, 4. approximate, 5. statistical, 6. classification metrics, and 7. differential entropy. We will discuss the general and approximate subcategory separately.

The bitwise aggregate functions are two functions that return the bitwise AND and bitwise OR or all input values in 2’s complement representation. The map aggregate functions provide convenient map creation functions from input values. The statistical aggregate functions are standard summary statistic functions you would expect, such as stddev, variance, kurtosis, and skewness. The classification metrics and differential entropy aggregate functions are specialized functions that make it easy to analyze binary classification predictive modelling and model binary differential entropy, respectively.

The general functions subcategory further fall into another layer of classification:

  1. Common summarizations: count, count_if, min, max, min_by, max_by, sum, avg, geometric_mean, checksum
  2. Boolean tests: bool_or, bool_and, every
  3. Data structure consolidation: array_agg, set_agg, set_union
  4. Miscellaneous: reduce_agg, arbitrary

Again, let’s apply these aggregate functions in a series of representative examples. In the following query, we apply a series of basic aggregations to our floating-point column x .

	sum(x) as sum_x,
	count(x) as count_x,
	min(x) as min_x,
	max(x) as max_x,
	avg(x) as avg_x,
	checksum(x) as ckh_x

The following is the output the above query.

In the following query, we showcase a boolean test with the bool_or function. We know that the natural log will return a NaN for negative values of x. So, if we apply the is_nan check, we expect x to always be false, but for our ln result to occasionally be true. Finally, if we were to do the bool_or aggregation on our is_nan functions, we expect the column derived from x to be false (i.e. no true at all) and the column derived fro ln(x) to be true (i.e. at least one true value). The following query and accompanying result illustrate this.

with nan_test as (
		is_nan(x) as is_nan_x,
		is_nan(ln(x)) as is_nan_ln_x
	bool_or(is_nan_x) as any_nan_x_true,
	bool_or(is_nan_ln_x) as any_nan_ln_x_true

This final example illustrates the use of an example of data consolidation, taking a x and radians(x) columns and creating a single row with a map data structure.

with rad as(select x, radians(x) as rad_x from example)
select map_agg(x, rad_x) from rad;

The approximate aggregate functions provide approximate results for aggregate large data sets, such as distinct values (approx_distinct), percentiles (approx_percentile), and histograms (numeric_histogram). In fact, we have a short answer post on how to use the approx_percentile function. Several of the approximate aggregate functions rely on other functions and data structures: quantile digest, HyperLogLog and KHyperLogLog.

A natural extension to aggregate functions are window functions, which perform calculations across rows of a query results. In fact, all aggregate functions can be used as window functions by adding an OVER clause. One popular application of window functions is time-series analysis. In particular, the lag function window function is quite useful. We have a short answer post on how to use the lag window function and to compute differences in dates using the lag window function.

This was short article was a high-level overview, and you are encouraged to review the Presto public documentation for Mathematical Functions and Operations and Aggregate Functions. If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

What is a Presto lag example?

The Presto lag function a window function that returns the value of an offset before the current row in a window. One common use case for the lag function is with time series analysis, such as autocorrelation.

Figure 1 shows the advert table of sales and advertising expenditure from Makridakis, Wheelwright and Hyndman (1998) Forecasting: methods and applications, John Wiley & Sons: New York. The advert column is the monthly advertising expenditure, and the sales column is the monthly sales volume.


A simple analysis could be to track the difference between the current month’s sales volume and the previous one, which is shown in Figure 2. The lag_1_sales column is a single period lagged value of the sales column, and the diff column is the difference between sales and lag_1_sales. To generate the table in Figure 2, we can use the lag function and the following query:

  round(sales - lag_1_sales,2) as diff
from (
    lag(sales, 1) over(range unbounded preceding) as lag_1_sales
  from advert

The subquery uses the lag function to get a one period offset preceding value of the sales column, where the OVER clause syntax is specifying the window. The main query then computes the diff column. Here are a couple of additional useful notes about the lag function:

  1. You can change the offset with the second argument lag(x, OFFSET), where OFFSET is any scalar expression. The current row is OFFSET=1.
  2. By default, if an offset value is null or outside the specified window, a NULL value is used. We can see this in the first row of the table in Figure 2. However, the default value to use in these cases is configurable with an optional third argument lag(x, OFFSET, DEFAULT_VALUE), where DEFAULT_VALUE is desired value.

A closely related function is the lead function returns the value at an offset after the current row.

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

How do I get the date_diff from previous rows?

To find the difference in time between consecutive dates in a result set, Presto offers window functions. Take the example table below which contains sample data of users who watched movies.


select * from movies.ratings_csv limit 10;

select userid, date_diff('day', timestamp, lag(timestamp) over (partition by userid order by  timestamp desc)) as timediff from ratings_csv order by userid desc limit 10;

The lag(x, y, start, end) function fetches the value of column x at row offset y and calculates the difference. When no offset is provided, the default value is 1 (previous row). Notice, that the first row in timediff is NULL due to not having a previous row.

How do I use the approx_percentile function in Presto?

The Presto approx_percentile is one of the approximate aggregate functions, and it returns an approximate percentile for a set of values (e.g. column). In this short article, we will explain how to use the approx_percentile function.

What is a percentile?

From Wikipedia:

In statistics, a percentile (or a centile) is a score below which a given percentage of scores in its frequency distribution falls (exclusive definition) or a score at or below which a given percentage falls (inclusive definition)

To apply this, we’ll walk through an example with data points from a known, and arguably most famous, distribution—-the Normal (or Gaussian) distribution. The adjacent diagram plots the density of a Normal distribution with a mean of 100 and standard deviation of 10. If we were to sample data points from this Normal distribution, we know that approximately half of the data points would be less that the mean and half of the data points would be above the the mean. Hence, the mean, or 100 in this case, would be the 50th percentile for the data. It turns out that the 90th percentile would approximately be 112.82; this means that 90% of the data points are less than 112.82.

approx_percentile by example

To solidify our understanding of percentiles and the approx_percentile function, we’ve created a few tables to use as example:

presto:default> show tables;
(4 rows)
TableDescriptionNumber of Rows
dummySingle column 100 row table of all ones except for a single value of 100.100
norm_0_1Samples from normal distribution with mean of 0 and standard deviation of 1.5000
norm_100_10Samples from normal distribution with mean of 100 and standard deviation of 10.5000
norm_allCoalescence of all normal distribution tables.10000
Table 1

The approx_percentile function has eight type signatures. You are encouraged to review the Presto public documentation for all the function variants and official descriptions. The set of values (e.g. column) is a required parameter and is always the first argument.

Another required parameter is the percentage parameter, which indicates the percentage or percentages for the returned approximate percentile. The percentage(s) must be specified as a number between zero and one. The percentage parameter can either be the second or third argument of the function, depending on the intended signature. In the following examples, the percentage parameter will be the second argument. For example, approx_percentile(x,0.5) will return the approximate percentile for column x at 50%. For data points in our norm_100_10 table, we expect the returned value to be around 100.

presto:default> select approx_percentile(x,0.5) from norm_100_10;
(1 row)

approx_percentile(x,0.9) will return the approximate percentile for column x at 90%, which for the data in norm_100_10 table should be around 112.82.

presto:default> select approx_percentile(x,0.9) from norm_100_10;
(1 row)

In a single query, you can also specify an array of percentages to compute percentiles: approx_percentile(x,ARRAY[0.5, 0.9]).

presto:default> select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_100_10;
 [99.8184647799587, 112.692881777202]
(1 row)

We can ask for multiple percentages for our dummy table, which consists of a 100 rows of all ones except for a single value of 100. Hence, we expect all percentiles below 99% to 1.

presto:default> select approx_percentile(x,ARRAY[0.1, 0.5, 0.98, 0.99]) from dummy;
 [1.0, 1.0, 1.0, 100.0]
(1 row)

We can also use a GROUP BY clause to segment the values to compute percentiles over. To illustrate this, we will use our norm_all table, which contains values from both the norm_100_10 and the norm_0_1 tables. The m and sd columns specify the mean and standard deviation of the normal distribution the corresponding x value is sampled from.

presto:default> select m, sd, x from norm_all order by rand() limit 10;
  m  | sd |         x
 0   | 1  | -0.540796486700647
 0   | 1  |   0.81148151337731
 0   | 1  |   1.28976310661005
 100 | 10 |   97.0272872801269
 100 | 10 |   83.1392343835652
 0   | 1  | -0.585678877703149
 0   | 1  |  0.268589447255106
 0   | 1  | -0.280908719376113
 100 | 10 |    104.36328077332
 0   | 1  |  0.266294347905949
(10 rows)

The following query then will return approximate percentile for 50% and 90% for data points grouped by the same values of m and sd (i.e. from the same normal distribution): select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_all group by grouping sets ((m,sd)). As expected, we see that the approximate 50% and 90% percentile is around 100 and 112.82 for a mean of 100 and standard deviation of 10 and around 0 and 1.28 for a mean of 0 and standard deviation of 1.

presto:default> select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_all group by grouping sets ((m,sd));
 [99.8563616481321, 112.879972343696]
 [-0.00458419083839064, 1.30949677294588]
(2 rows)

An optional parameter is accuracy, which controls the maximum rank error and defaults to 0.01. The value of accuracy must be between zero and one (exclusive) and must be constant for all input rows. We can add accuracy as the third argument to our function. For example, approx_percentile(x,0.9,0.5) will return the approximate percentile for column x at 90% with 0.5 maximum rank error accuracy. By allowing for larger error (from the 0.01 default), we can see the approximate percentile of 113.50 is farther away from our true value of 112.82 than our previous result of 112.88.

presto:default> select approx_percentile(x,0.9,0.5) from norm_100_10;
(1 row)

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

Can I write back or update data in my Hadoop / Apache Hive cluster through Presto?

Using Presto with a Hadoop cluster for SQL analytics is pretty common especially in on premise deployments. 

With Presto, you can read and query data from the Hadoop datanodes but you can also make changes to data in Hadoop HDFS. There are however some restrictions. 

All this is enabled via Presto’s Hive Connector. 

The first step is to create a catalog properties file and point to the Hive Metastore. 

You can also optionally configure some Hive Metastore properties for the Hive Connector. 

Create etc/catalog/hive.properties with the following contents to mount the hive-hadoop2 connector as the hive catalog, replacing example.net:9083 with the correct host and port for your Hive metastore Thrift service:



For basic setups, Presto configures the HDFS client automatically and does not require any configuration files. 

Creating a new table in HDFS via Hive

Using Presto you can create new tables via the Hive Metastore. 


Create table: 

Create a new Hive table named page_views in the web schema that is stored using the PARQUET file format, partitioned by date and country. HIVE is the name of the connector – that is the name of the properties file. 

CREATE TABLE hive.web.page_views (

  view_time timestamp,

  user_id bigint,

  page_url varchar,

  ds date,

  country varchar



  format = 'ORC',

  partitioned_by = ARRAY['ds', 'country']


Deleting data from Hive / Hadoop

With the Hive connector, you can delete data but this has to be at the granularity of entire partitions. 


Drop a partition from the page_views table:

DELETE FROM hive.web.page_views

WHERE ds = DATE '2016-08-09'

  AND country = 'US'

Drop the external table request_logs. This only drops the metadata for the table. The referenced data directory is not deleted:

DROP TABLE hive.web.request_logs

Drop a schema:

DROP SCHEMA hive.web

Hive Connector Limitations

  • DELETE is only supported if the WHERE clause matches entire partitions.
  • UPDATE is not supported from Presto

How do I convert Unix Epoch time to a date or something more human readable with SQL?

Many times the Unix Epoch Time gets stored in the database. But this is not very human readable and conversion is required for reports and dashboards. 

Example of Unix Epoch Time: 


Presto provides many date time functions to help with conversion. 

In case of a Unix Epoch Time, the from_unixtime function can be used to convert the Epoch time. 

This function returns a timestamp. 

from_unixtime(unixtime) → timestamp
Returns the UNIX timestamp unixtime as a timestamp.

After converting the Unix Epoch time to a timestamp, you can cast it into other formats as needed such as extracting just the date. Examples follow below. 



select from_unixtime(1529853245) as timestamp;



2018-06-24 15:14:05.000


select cast(from_unixtime(1529853245) as date) as date;



More examples and information can be found here: https://ahana.io/answers/how-to-convert-date-string-to-date-format-in-presto/

How do I transfer data from a Hadoop / Hive cluster to a Presto cluster?

Hadoop is a system that manages both compute and data together. Hadoop cluster nodes have the HDFS file system and may also have different types of engines like Apache Hive, Impala or others running on the same or different nodes. 

In comparison, Presto, an open source SQL engine built for data lakes, is only a query engine. This means that it does not manage its own data. It can query data sitting in other places like HDFS or in cloud data lakes like AWS S3. 

Because of this there is no data transfer or ingestion required into Presto for data that is already residing in an HDFS cluster. Presto’s Hive Connector was specifically designed to access data in HDFS and query it in Presto. The Hive connector needs to be configured with the right set of config properties 

The Presto Hive connector supports Apache Hadoop 2.x and derivative distributions including Cloudera CDH 5 and Hortonworks Data Platform (HDP).

Create etc/catalog/hive.properties with the following contents to mount the hive-hadoop2 connector as the hive catalog, replacing example.net:9083 with the correct host and port for your Hive metastore Thrift service:



For basic setups, Presto configures the HDFS client automatically and does not require any configuration files. In some cases, such as when using federated HDFS or NameNode high availability, it is necessary to specify additional HDFS client options in order to access your HDFS cluster. To do so, add the hive.config.resources property to reference your HDFS config files:


Presto substring operations: How do I get the X characters from a string of a known length?

Presto provides an overloaded substring function to extract characters from a string. We will use the string “Presto String Operations” to demonstrate the use of this function.

Extract last 7 characters:

presto> SELECT substring('Presto String Operations',-7) as result;




(1 row)

Query 20210706_225327_00014_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Extract last 10 characters:

presto> SELECT substring('Presto String Operations',-10) as result;




(1 row)

Query 20210706_225431_00015_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Extract the middle portion of the string:

presto> SELECT substring('Presto String Operations',8,6) as result;




(1 row)

Query 20210706_225649_00020_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:01 [0 rows, 0B] [0 rows/s, 0B/s]

Extract the beginning portion of the string:

presto> SELECT substring('Presto String Operations',1,6) as result;




(1 row)

Query 20210706_225949_00021_rtu2h, FINISHED, 1 node

Splits: 17 total, 17 done (100.00%)

0:00 [0 rows, 0B] [0 rows/s, 0B/s]

What is Spark SQL?

Spark is a general purpose computation engine for large-scale data processing. At Spark’s inception, the primary abstraction was a resilient distributed dataset (RDD), an immutable distributed collection of data. Since then, higher level abstractions—called DataFrames and Datasets—that more closely resemble classic database tables have been introduced to work with structured data. Spark SQL is the Spark module for working with these abstractions and structured data.

In addition to DataFrames and Datasets, Spark SQL also exposes SQL to interact with data stores and DataFrames/Datasets. For example, let’s say we wanted to return all records from a table called people with the basic SQL query: SELECT * FROM people. To do so with Spark SQL, we could programmatically express this in Python as follows:

people_dataframe = spark.sql(“SELECT * FROM people”)

spark is a SparkSession class, the main Spark entry point for structured data abstractions, and the statement would return a Spark DataFrame. With the SQL API, you can express SQL queries and get back DataFrames, and from DataFrames, you can create tables by which you can execute SQL queries on top of. Because the SQL language is widely known, it allows a broader range of data practitioner personas, such as SQL analysts, to perform data processing on top of Spark.

Since Spark 3.0, Spark SQL introduced experimental options to be strictly ANSI compliant instead of being Hive compliant. Prior, Spark SQL supported both ANSI SQL and HiveQL.  Please consult the official Apache Spark SQL Reference if you are interested in the specifics of supported syntax, semantics, and keywords.

Regardless of whether you express data processing directly with DataFrame/Dataset methods or SQL, Spark SQL runs the same execution engine under the hood.  Further, through Spark SQL, the structured nature of the data and processing provide additional context to Spark about the data itself—such as the column types—and the workload.  This additional context allows for additional optimization, often resulting in better performance.

While Spark SQL is a general-purpose engine, you might want to consider Presto if your target use cases are predominantly interactive, low-latency queries on structured data. We compare Spark SQL and Presto in this short article.

How do I query a data lake with Presto?

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. Structured and semi-structured data can be queried by Presto, an open source SQL engine. This allows users to store data as-is, without having to first structure the data, and run different types of analytics. 

To query this data in data lakes, the following technologies are needed. 

  1. A SQL query engine – Presto was built for querying data lakes like HDFS and now increasingly AWS S3 and Google Cloud Platform – Google Cloud Storage and others. 
  2. A big data catalog – there are two popular big data catalog systems also called metastores – the Hive Metastore and AWS Glue service. 
  3. Buckets in the data lake like AWS S3 

What types of data can be queried by Presto? 

The following file formats can be queried by Presto 

  1. ORC.
  2. Parquet.
  3. Avro.
  4. RCFile.
  5. SequenceFile.
  6. JSON.
  7. Text.
  8. CSV

How does it work? 

First, data in the data lake needs to be mapped into tables and columns. This is what the Hive Metastore and AWS Glue catalogs help with. Example, if there is a CSV file, once Presto, Hive Metastore and Glue are integrated, users can use Presto commands to create a schema and then the create table statement and map the file to a table and columns. 


USE ahana_hive.default; 

	registration_dttm 	timestamp, 
	id 					int,
	first_name 			varchar,
	last_name 			varchar,
	email 				varchar,
	gender 				varchar,
	ip_address 			varchar,
	cc 					varchar,
	country 			varchar,
	birthdate 			varchar,
	salary 				double,
	title 				varchar,
	comments 			varchar
) WITH ( 
format = CSV, 
skip_header_line_count = 1,
external_LOCATION = 's3a://ahana/userdata/' );

Once the table is created and mapped to the external location, it can immediately be queried. 


Select * from user; 

You can run these commands in Presto using the Presto-cli. More information in the docs.

If you are looking for better performance, it is recommended to convert formats like JSON, CSV into more optimized formats like Apache Parquet and Apache ORC. This will improve query performance greatly. This can also be done with Presto using the CREATE TABLE AS command. More information on this here

Ahana Cloud makes it very easy to query a data lake with Presto. It is a managed service for Presto and also comes with a built-in Hive Metastore so that you don’t need to deploy and manage one. In addition, it can also integrate with AWS Glue. Getting started with Ahana Cloud is easy. Here’s how: https://ahana.io/docs/getting-started

Additional resources: 

Presto Docker Container 
Presto-cli Docker Container 
Presto Sandbox Docker Container 

Why am I getting a Presto EMR S3 timeout error?

If you’re using AWS EMR Presto, you can use the S3 select pushdown feature to push down compute operations (i.e. SELECT) and predicate operations (i.e. WHERE) to S3. Pushdown makes query performance much faster because it means queries will only retrieve required data from S3. It also helps in reducing the amount of data transferred between EMR Presto and S3.

If you’re using pushdown for EMR Presto and seeing a timeout error, there might be a few reasons for that. Because Presto uses EMRFS as its file system, there’s a maximum allowable number of client connections to S3 through EMRFS for Presto (500). When using S3 Select Pushdown, you bypass EMRFS when you access S3 for predicate operations so the value of hive.s3select-pushdown.max-connections is what will determine the max number of client connections allowed by worker nodes. Requests that aren’t pushed down use the value of fs.s3.maxConnections.

At this point you might get an error that says “timeout waiting for connection from pool”. That’s because you need to increase the value of both of those values above. Once you do that, that should help solve this problem.

Errors like these are common with Presto EMR. EMR is complex and resource-intensive, and there’s a lot you have to understand when it comes to the specific config and turning parameters for Hadoop. Many companies have switched from EMR Presto to Ahana Cloud, a managed service for Presto on AWS that is much easier to use. Ahana Cloud is a non-Hadoop deployment in a fully managed environment. Users see up to 23x performance with Ahana’s built-in caching.

Check out some of the differences between Presto EMR and Ahana Cloud. If you’re using EMR Presto today, Ahana Cloud might help with some of those pain points. Additionally, Ahana is pay-as-you-go pricing and it’s easy to get started if you’re already an EMR user. 

Do I need to move my data to query it with Presto?

No, Presto queries your data in-place so you don’t need to move it. If you’re using AWS S3 for your data lake, for example, you wouldn’t need to ingest it to query as you would if you were using a data warehouse like AWS Redshift. 

To bring Presto compute to your data, you can leverage Ahana Cloud.

With Ahana Cloud, it’s very easy to leverage the power of Presto to query AWS S3. You just connect your data source to Ahana and everything continues to run in your cloud account (called in-vpc). It’s just a click of a button to add data sources to Ahana Cloud for querying. Ahana Cloud would be a replacement for Amazon Athena, EMR Presto, or if you’re running Presto on your own in AWS. It’s a managed service for Presto that takes care of all the configuration, tuning, deployment, managing, attaching/detaching data sources, etc.
You can learn more about Ahana Cloud, and you can also sign up to trial it too.

How do I sync my partition and metastore in Presto?

Sync partition metadata is used to sync the metastore with information on the file system/s3 for the external table. Depending upon the number of partitions the sync can take time.

Here is a quick reference from the presto docs: https://prestodb.io/docs/current/connector/hive.html?highlight=sync_partition_metadata


  • system.create_empty_partition(schema_name, table_name, partition_columns, partition_values)
    Create an empty partition in the specified table.
  • system.sync_partition_metadata(schema_name, table_name, mode, case_sensitive)
    Check and update partitions list in metastore. There are three modes available:
    • ADD : add any partitions that exist on the file system but not in the metastore.
    • DROP: drop any partitions that exist in the metastore but not on the file system.
    • FULL: perform both ADD and DROP.

The case_sensitive argument is optional. The default value is true for compatibility with Hive’s MSCK REPAIR TABLE behavior, which expects the partition column names in file system paths to use lowercase (e.g. col_x=SomeValue). Partitions on the file system not conforming to this convention are ignored, unless the argument is set to false.

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

How do I run a CTAS (Create Table As) with a Query?

There are a few different ways to run a CTAS with a Query in Presto. Below we’ll lay those out.

OPTION 1: Managed Schema
Managed Schema Creation:
CREATE SCHEMA managed_schema WITH (location = 's3a://mybucket/managed_schema/');

Create Table AS with Managed Schema:

CREATE TABLE managed_movies_orc
COMMENT 'Managed Movies ORC'
WITH (format = 'ORC')
select * from "ahana-demo".ahana_s3_movies

OPTION 2: External Table 
Technically you can’t do a create table as with an external table you will get the error:

presto error: External tables cannot be created using CREATE TABLE AS

You can however do a create table followed by insert into once the hive.non-managed-table-writes-enabled is set to true

CREATE TABLE <your schema>.ext_table_movies_orc( movieid bigint, title varchar, genres varchar ) 
( format = 'ORC',
  external_location= 's3a://mybucket/externa_table/'

insert into <your schema>.ext_table_movies_orc select * from "ahana-demo".ahana_s3_movies

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

What is the difference between a managed table and external tables?

The main difference between a managed and external table is that when you drop an external table, the underlying data files stay intact. This is because the user is expected to manage the data files and directories. With a managed table, the underlying directories and data get wiped out when the table is dropped.

In Presto:
External Table: Table created using WITH has ‘external_location’
Managed Table: Table created in schema which has WITH used has ‘location’

You cannot “insert into” an external table (By default, the setting hive.non-managed-table-writes-enabled=false prevents you from doing so). 

The expectation that the data in the external table is managed externally. e.g. Spark, Hadoop, Python Scripts, or another external ETL process.

Below are the major differences between Internal vs External tables in Apache Hive.

By default, Hive creates an Internal or Managed Table.Use EXTERNAL option/clause to create an external table
Hive owns the metadata, table data by managing the lifecycle of the tableHive manages the table metadata but not the underlying file.
Dropping an Internal table drops metadata from Hive Metastore and files from HDFSDropping an external table drops just metadata from Metastore without touching the actual file on HDFS/S3
Metadata on Inserts, creation of new partitions, etc. are updated automatically during inserts through the metastoreYou need to explicitly run sync_partitions to sync changes on S3 with the metastore

In short, use managed tables when the metastore should manage the lifecycle of the table, or when generating temporary tables. Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.

What is Presto and what are its frequently asked questions (FAQ)?

What is Presto?

Presto is a distributed SQL query engine written in Java. It takes any query written in SQL, analyzes the query, creates and schedules a query plan on a cluster of worker machines which are connected to data sources, and then returns the query results. The query plan may have a number of execution stages depending on the query. For example, if your query is joining together many large tables, it may need multiple stages to execute, aggregating tables together. After each execution stage there may be intermediate data sets. You can think of those intermediate answers like your scratch‐ pad for a long calculus problems.

Why is Presto so fast?

In the past, distributed query engines like Hive were designed to persist intermediate results to disk. As the below figure illustrates, Presto saves time by executing the queries in the memory of the worker machines, including performing operations on intermediate datasets there, instead of persisting them to disk. The data can reside in HDFS or any database or any data lake, and Presto performs the executions in-memory across your workers, shuffling data between workers as needed. Avoiding the need for writing and reading from disk between stages ultimately speeds up the query execution time. Hive intermediate data sets are persisted to disk. Presto executes tasks in-memory.

If this distributed in-memory model sounds familiar, that’s because Apache Spark uses the same basic concept to effectively replace MapReduce-based technologies. However, Spark and Presto manage stages differently. In Spark, data needs to be fully processed before passing to the next stage. Presto uses a pipeline processing approach and doesn’t need to wait for an entire stage to finish.
Presto was developed with the following design considerations:
• High performance with in-memory execution
• High scalability from 1 to 1000s of workers
• Flexibility to support a wide range of SQL use cases
• Highly pluggable architecture that makes it easy to extend Presto with custom integrations for security, event listeners, etc.
• Federation of data sources via Presto connectors
• Seamless integration with existing SQL systems by adhering to the ANSI SQL standard

Portions of this answer are excerpted from the O’Reilly book “Learning and Operating Presto,” get your free digital copy from Ahana: https://ahana.io/ebook/

What is Presto used for?

Presto is used as the distributed query engine as part of an analytics platform. The presto engine is used for many different types of analytical workloads to address a variety of problems. Leveraging a separate storage and compute model, also known as “disaggregated,” Presto can query one or more data stores, metastores, and SQL tools.

What are common Presto use cases?

1. Ad hoc querying
Presto was originally designed for interactive analytics, or ad hoc querying. That means results are returned in a matter of a few seconds so an analyst can drill-down and ask questions of large amounts of data interactively. In today’s “Internet era” competitive world and with data systems able to collect granular amounts of data in near-real-time, engineers, analysts, data scientists, and product managers all want the ability to quickly analyze their data and become data-driven organizations that make superior decisions, innovating quickly and improving their businesses.

They either simply type in simple queries by hand or use a range of visualization, dashboarding, and BI tools. Depending on the tools chosen, they can run 10s of complex concurrent queries against a Presto cluster. With Presto connectors and their in-place execution, platform teams can quickly provide access to the data sets that users want. Not only do analysts get access, but they can run queries in seconds and minutes–instead of hours–with the power of Presto, and they can iterate quickly on innovative hypotheses with the interactive exploration of any data set, residing anywhere.

2. Reporting and dashboarding
Because of the design and architecture of Presto and its ability to query across multiple sources, Presto is a great backend for reporting and dashboarding. Unlike the first-generation static reporting and dashboarding, today’s interactive reporting and dashboards are very different. Analysts, data scientists, product managers, marketers and other users not only want to look at KPI’s, product statistics, telemetry data and other data, but they also want to drill down into specific areas of interest or areas where opportunity may lie. This requires the backend – the underlying system – to be able to process data fast wherever it may sit. To support this type of self-service analytics, platform teams are required to either consolidate data into one system via expensive pipelining approaches or test and support every reporting tool with every database, data lake and data system their end users want to access. Presto gives data scientists, analysts and other users the ability to query data across sources on their own so they’re not dependent on data platform engineers. It also greatly simplifies the task of the data platform engineers by absorbing the integration testing and allowing them to have a single abstraction and end point for a range of reporting and dashboarding tools.

3. ETL using SQL
Analysts can aggregate terabytes of data across multiple data sources and run efficient ETL queries against that data with Presto. Instead of legacy batch processing systems, Presto can be used to run resource-efficient and high throughput queries. ETL can process all the data in the warehouse; it generates tables that are used for interactive analysis or feeding various downstream products and systems.

Presto as an engine is not an end-to-end ETL system, nor is Hive or Spark. Some additional tools can be easily added to coordinate and manage numerous on-going time-based jobs, a.k.a. cron jobs, which take data from one system and move it into another store, usually with a columnar format. Users can use a workflow manage‐ ment system like open source Apache Airflow or Azkaban. These automate tasks that would normally have to be run manually by a data engineer. Airflow is an open source project that programmatically authors, schedules and monitors ETL work‐ flows, and was built by Airbnb employees who were former Facebook employees. Azkaban, another open source project, is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs.

The queries in batch ETL jobs are much more expensive in terms of data volume and CPU than interactive jobs. As such the clusters tend to be much bigger. So some companies will separate Presto clusters: one for ETL and another one for ad hoc queries. This is operationally advantageous since it is the same Presto technology and requires the same skills. For the former, it’s much more important that the throughput of the entire system is good versus latency for an individual query.

4. Data Lake Analytics
Data lakes have grown in popularity along with the rise of Amazon S3-compatible object storage which Amazon AWS has made popular. A data lake enables you to store all your structured and unstructured data as-is and run different types of analytics on it.

A data warehouse is optimized to analyze relational data, typically from transactional systems and business applications, where the data is cleaned, enriched, and transformed. A data lake is different from a data warehouse in that it can store all your data–the relational data as well as non-relational data from a variety of sources, such as from mobile apps, social media, time-series data–and you can derive more new insights from the analysis of that broader data set. Again you can do so without necessarily needing to process and copy that data beforehand.

Presto is used to query data directly on a data lake without the need for additional ingestion or copies. You can query any type of data in your data lake, including both structured and unstructured data. As companies become more data-driven and need to make faster, more informed decisions, the need for analytics on an increasingly larger amount of data has become a higher priority in order to do business.

5. Real-time analytics with real-time databases
Real-time analytics is becoming increasingly used in conjunction with consumer- facing websites and services. This usually involves combining data that is being captured in real time with historical or archived data. Imagine if an e-commerce site had a history of your activity archived in an object store like S3, but your current session activity is getting written to a real-time database like Apache Pinot. Your current session activity may not make it into S3 for hours until the next snapshot. By using Presto to unite data across both systems, that website could provide you with real- time incentives so you don’t abandon your cart, or it could determine if there’s possible fraud happening earlier and with greater accuracy.

Portions of this article are excerpted from the O’Reilly book “Learning and Operating Presto,” get your free digital copy from Ahana: https://ahana.io/ebook/. This includes a chapter of Presto with Pinot, a real-time database.

What is Apache Presto?

Apache Presto is a misnomer in that it is not a project hosted under the Apache Software Foundation (ASF). Incubator and top level ASF projects are subject to the naming conventions “Apache [Project Name]” One example of this is Apache Cassandra. Instead, the open source presto project is with another well-known project hosting body, The Linux Foundation. They make sub-foundations and Presto is specifically under the Presto Foundation. This is similar to how Kubernetes is the project in Cloud Native Computing Foundation, which is a sub-foundation of The Linux Foundation. The misnomer may have arisen from the fact that most open source projects use the Apache 2.0 license, which Presto is licensed with.

Presto is a distributed SQL query engine written in Java. It takes any query written in SQL, analyzes the query, creates and schedules a query plan on a cluster of worker machines which are connected to data sources, and then returns the query results. The query plan may have a number of execution stages depending on the query. For example, if your query is joining together many large tables, it may need multiple stages to execute, aggregating tables together. After each execution stage there may be intermediate data sets. You can think of those intermediate answers like your scratch‐ pad for a long calculus problem.

What Is Trino & FAQs

What is Trino?

Trino is an apache 2.0 licensed, distributed SQL query engine, which was forked from the original Presto project whose Github repo was called PrestoDB. As such, it was designed from the ground up for fast queries against any amounts of data. It supports any types of data sources including relational and non-relational sources via its connector architecture.

What is the history of Trino?

Trino is a hard fork of the widely popular open source Presto project which started out at Facebook, running large-scale interactive analytic queries against a 300PB data lake using Hadoop/HDFS-based clusters. Prior to building Presto, Facebook used Apache Hive. In November, 2013, Facebook open sourced Presto under the Apache 2 license, and made it available in the public GitHub code repository named “prestodb”. In early 2019, the hard fork named Trino was started by the creators of Presto who later became cofounder/CTOs of the commercial vendor Starburst. In the meantime, Presto became part of the openly governed Presto Foundation, hosted under the guidance and experience of The Linux Foundation. Trino has subsequently divereged from Presto. Many of the innovations the community is driving in Presto are not available in Trino. Today, only Presto, not Trino is running at companies like Facebook, Uber, Twitter, and Alibaba.

Why is Trino so fast?

As Trino is a hard fork of the original Presto project, it carries with it some of the original elements which make Presto so fast, namely the in-memory execution architecture. Prior to Presto, distributed query engines such as Hive were designed to store intermediate results to disk.

How does Trino work?

Trino is a distributed system that runs on Hadoop, and uses an architecture similar to massively parallel processing (MPP) databases. It has one coordinator node working with multiple worker nodes. Users submit SQL to the coordinator which uses query and execution engine to parse, plan, and schedule a distributed query plan across the worker nodes. It supports standard ANSI SQL, including complex queries, joins aggregations, and outer joins.

What is Apache Trino?

Apache Trino is a misnomer in that Trino is not a project hosted under the well-known Apache Software Foundation (ASF). Apache Incubator and top level projects are subject to the naming conventions “Apache [Project Name].” An example of which is Apache Mesos. Instead trino project which is a hard fork of Presto is with a vendor controlled non-profit called the Trino Software Foundation. It is not affiliated with any well-known project hosting organizations like ASF for The Linux Foundation. The misnomer may have arisen from the fact that most open source projects use the Apache 2.0 license, which trino is licensed with.

Is Trino OLAP?

Trino is an open source distributed SQL query engine. It is a hard fork of the original Presto project created by Facebook. It lets developers run interactive analytics against large volumes of data. With Trino, organizations can easily use their existing SQL skills to query data without having to learn new complex languages. The Trino architecture is quite similar to traditional online analytical processing (OLAP) systems using distributed computing architectures, in which one controller node coordinates multiple worker nodes.

What is the Trino Software Foundation?

The Trino Software Foundation is a non-profit corporation which is controlled by the cofounders of the commercial vendor Starburst. The Trino Software Foundation has the open source Trino project. It is a hard fork of the Presto project, which is separate and hosted by the Linux Foundation. From the trino website there’s only two sentences about the foundation: “The Trino Software Foundation (formerly Presto Software Foundation) is an independent, non-profit organization with the mission of supporting a community of passionate users and developers devoted to the advancement of the Trino distributed SQL query engine for big data. It is dedicated to preserving the vision of high quality, performant, and dependable software.” What is not mentioned is any form of charter or governance. These are tables stakes for Linux Foundation projects, where the project governance is central to the project.

What SQL does Trino use?

Trino, like the original Presto, is built with a familiar SQL query interface that allows interactive SQL on many data sources. Standard ANSI SQL semantics are supported, including complex queries, joins, and aggregations.

What Is A Trino database?

Trino’s distributed system runs on Hadoop/HDFS and other data sources. It uses a classic MPP model (massively parallel processing). The java-based system has a coordinator node (master) working in conjunction with a scalable set of worker nodes. Users send their SQL query through a client to the Trino coordinator which plans and schedules a distributed query plan across all its worker nodes. Trino and Presto are SQL query engines and thus are not databases by themselves. They do not store any data, but from a user perspective, Trino can appear as a database because it queries the connected data stores.

What is the difference between Presto and Trino?

There are technical innovations and differences between Presto and Trino that include:
– Presto is developed, tested, and runs at scale at Facebook, Uber, and Twitter
– Presto uses 6X less memory and repartitions 2X faster with project Aria
– “Presto on Spark” today can run massive batch ETL jobs.
– Presto today is 10X faster with project RaptorX, providing caching at multiple levels
– The Presto community is making Presto more reliable and scalable with multiple coordinators instead of the single point of failure of one coordinator node.  

Trino can query data where it is stored, without needing to move data into separate warehouse or analytics database. Queries are executed in parallel with the memory of distributed worker machines. Most results return in seconds of time. Whereas Trino is a new fork, Presto continues to be used by many well-known companies: Facebook, Uber, Twitter, AWS. Trino is vendor driven project, as it is hosted in a non-profit organization which is owned by the cofounders of the Trino vendor Starburst. In comparison, Presto is hosted by Presto Foundation, a sub-foundation under The Linux Foundation. There are multiple vendors who support Presto, including the Presto as a Service (SaaS) offerings: Ahana Cloud for Presto and AWS Athena, which is based on Presto, not Trino.

As the diagram below illustrates, Presto saves time by running queries in the memory of the worker machines, running operations on intermediate datasets in-memory which is much faster, instead of persisting them to disk. It also shuffles data amongst the workers as needed. This also obviates the writes to disk between the stages. Hive intermediate data sets are persisted to disk. Presto executes tasks in-memory.

Whereas the pipelining approach between Presto and Trino is shared, Presto has a number of performance innovations that are not in Trino such as caching.  For more about the differences, see the April 2021 talk by Facebook at PrestoCon Day, which describe what they, along with others like Ahana, are doing to push the technology forward.

Trino is a distributed SQL query engine that is used best for running interactive analytic workloads on your data lakes and data sources. It is used for similar use cases that the original Presto project was designed for. It allows you to query against many different data sources whether its HDFS, Postgres, MySQL, Elastic, or a S3 based data lake. Trino is built on Java and can also integrate with other third party data sources or infrastructure components. 

After the query is parsed, Trino processes the workload into multiple stages across workers. Computing is done in-memory with staged pipelines.

To make Trino extensible to any data source, it was designed with storage abstraction to make it easy to build pluggable connectors. Because of this, Trino has a lot of connectors, including to non-relational sources like the Hadoop Distributed File System (HDFS), 
Amazon S3, Cassandra, MongoDB, and HBase, and relational sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server. Like the original community-driven open source Presto project, the data is queried where it is stored, without the need to move it into a separate analytics system.  

What to hear more about Ahana – the easiest Presto managed service ever made? Learn more about Ahana Cloud

Price-Performance Ratio of AWS Athena Presto vs Ahana Cloud for Presto

Both AWS Athena and Ahana Cloud are based on the popular open-source Presto project which was originally developed by Facebook and later donated to the Linux Foundation’s Presto Foundation. There are a handful of popular services that use Presto, including both AWS Athena and Ahana Cloud. 

The biggest difference between the two is that AWS Athena is a serverless architecture while Ahana Cloud is a managed service for Presto servers. The next biggest difference is the pricing model. Instead of paying for the amount of compute used by AWS Athena, you pay by the amount of data scanned. On the other hand, Ahana Cloud is priced by the amount of compute used. This can be a huge difference in price/performance. Before we get into the price-performance specifically, here’s an overview of the comparison:

AWS Athena (serverless Presto)Ahana Cloud for Presto (managed service)
Cost dimensionPay for the amount of data is scanned on on a per query basis at USD $5 per Terabyte Scanned. It may be hard to estimate how much data your queries will scan. Pay only for EC2 usage on a per node / hour basis for EC2 and Ahana
Cost effectivenessOnly pay while the query is scanning, not for idle timesOnly pay for EC2 and Ahana Cloud while compute resources are running, plus ~$4 per day for the managed service
ScaleAWS Athena can scale query workloads but has concurrency limitsAhana easily can scale query workloads without concurrency limits
Operational overheadLowest operational overhead: no need to patch OS – AWS handles thatLow operational overhead: no need to patch OS – Ahana Cloud handles that and the operation of servers
Update frequencyInfrequent updates to the platform. Not current with PrestoDB, over 60 releases behind.Frequent updates to the platform. Typically, Presto on Ahana Cloud is upgraded on a quarterly basis to keep up with the most recent releases.

Both let you focus on deriving insight from your analytical queries, as you can leave the heavy lifting of managing the infrastructure to AWS and the Ahana Cloud managed service. 

How do you define price-performance ratio?

Price–performance ratio
From Wikipedia, the free encyclopedia
In engineering, the price–performance ratio refers to a product’s ability to deliver performance, of any sort, for its price. Generally, products with a lower price/performance ratio are more desirable, excluding other factors.

Comparing the Price-Performance Ratio of Amazon Athena vs. Ahana Cloud

For this comparison, we’ll look at performance in terms of the amount of wall-clock time it takes for a set of concurrent queries to finish. The price is the total cost of running those queries. 

Instead of using a synthetic benchmark, we’ll look at the public case study on the real-world workloads from Carbon, who used Athena and then switch to Ahana Cloud. While your workloads will be different, you’ll see why the price-performance ratio is likely many times better with Ahana Cloud. And by going through an example, you’ll be able to also apply the same method when doing a quick trial (we’re here to help too.)

Here’s a few things that the Carbon public case study showed:

  • While you cannot tell how many or type of EC2 instances that are used by Athena V2, they determined that they could get similar performance with 10 c5.xlarge workers with Ahana Cloud
  • Athena V2 would start to queue queries after there were 2 other queries running, meaning that the amount of wall-clock time was extended as a result.

AWS Athena is constrained by AWS concurrency limits

Ahana has higher concurrency so queries finish faster

  • The queries would be charged at a rate of $5/TB scanned regardless of the amount of compute used. Their 7 tests ended up scanning X TBs = $Y 
  • Ahana Cloud with 10 X c5.xlarge workers has total costs of:
Presto Workerc5.xlarge17 cents10$1.70
Presto Coordinatorc5.xlarge17 cents1$0.17
Ahana Cloud10 cents11$1.10
Ahana Cloud Managed Service8 cents1$0.08

So, you can run many queries for one hour that scan any amount of data for only $3.45 compared to one query of Athena scanning one TB of data costing $5.00.


While there is value in the simplicity of AWS Athena’s serverless approach, there are trade-offs around price-performance. Ahana Cloud can help.

Ahana is an easy cloud-native managed service with pay-as-you-go-pricing for your PrestoDB deployment. 

Get started with a 14-day free trial today.

What are the AWS Glue partition limits and does it apply to AWS Athena?

Typically you’ll use AWS Glue to create the data sources and tables that Athena will query. You could also create your tables using Athena and then use the tables in AWS Glue. Today the limit for AWS Glue partitions is 10M while Athena’s partition limit is 20K partitions per table. So if you’re using AWS Glue you get more partitions if you’re using the AWS Glue catalog with your Athena deployment. 

When it comes to using Amazon Athena, there are a lot of other limits besides partitions including query, concurrent query, and database limits. Many times AWS Athena users find these limits to hinder SLAs and other business-critical applications. As a result, they’re looking for alternatives to AWS Athena and have found that Ahana Cloud’s Presto managed service can address those issues.

Ahana Cloud is a Presto managed service that gives users the power of Presto, similar to AWS Athena, but without the limitations around queries, performance, and control. With Ahana you get the full control of your Presto deployment on AWS, so you can scale as you need without having to worry about partition or query limits.

Learn more about how Ahana Cloud compares to AWS Athena.

Ready to get started? Sign up try out Ahana Cloud for free – if you’re an AWS Athena user, it’s really easy to move from Athena to Ahana Cloud and requires not change in your existing architecture.

You can also check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

What level of concurrency performance can I expect using Presto as part of the AWS Athena service?

I’m getting a lot of my workloads queued up when I use AWS Athena. What are my other options?


Concurrency is an important factor when it comes to submitting a lot of query workloads. Many data platform teams depend on being able to handle multiple requests from multiple users at the same time. In AWS Athena you can submit queries but they won’t always be running. The number of concurrently running queries provides you with your overall performance. As you may know, AWS Athena is serverless, so you don’t see much behind the curtains. As such depending on what amount the load is on the shared service (with other customers), you’ll see different operating characteristics, leading to non-deterministic performance. 

When your number of running queries hits a limit, any additional queries will be put into a queue to wait for queries to finish. We’ve heard that queuing can occur when there are as low as 3 or 5 concurrently running queries. This can make it difficult to provide consistent levels of performance to your users. 

But I thought you could increase the service limits to address the queuing behavior?

Yes and no. AWS Athena has soft limits of the number of active queries. But it turns out that raising those limits don’t change the number of concurrently running queries, instead it seems to allow you to submit more requests without getting the “TooManyRequest” exception and there may be more queries that are in a RUNNING/0KB state, meaning they’re queued. 

Other alternatives

If you need to have consistent performance across a large number of queries that another approach would be to run your own Presto service. AWS Athena is a serverless version of Presto but as mentioned, has limits on performance. In addition, the pay per TB scanned model of Athena can also be cost prohibitive. There are 2 options for running your own query service: 

  1. Operating your own Presto clusters with AWS EMR or by yourself using AMIs
  2. Using a managed service like Ahana Cloud for Presto

On #2, Ahana Cloud for Presto provides you with a scalable amount of concurrent query performance but without the operations burden of managing your own clusters. Because the Presto is a distributed SQL query engine, by adding more Presto instances you can increase the amount of queries that are running at the same time:

To compare the performance of a set of query workloads, it’s always best to look at the overall throughput, which means including the wait time associated with queued workloads in Athena. It is for this reason that running your own Presto service will always allow you to have higher performance with more flexibility than AWS Athena. 

This also has price-performance considerations and taking into account the wait time associated with queued workloads, you’ll see a much higher price-performance ratio when comparing AWS Athena to your own Presto service. 

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

How do I get deterministic performance out of Amazon Athena?

What is Athena?

Amazon Athena is an interactive query service based on Presto that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage. This approach is advantageous for interactive querying on datasets already residing in S3 without the need to move the data into another analytics database or a cloud data warehouse.

What is Athena great for?

Users love Athena for the simplicity, ease of getting started, and the fact that no servers need to be managed. You only pay for the amount of data scanned. Athena is closely integrated with glue, so if you are already using the glue catalog for your ETL workloads, all your tables are already defined and accessible. You can then make these tables available for interactive analytics to your extended team.

What are the trade-offs?

The simplicity of the deployment architecture of Amazon Athena, however, comes at a price of inconsistent performance with scale, as many users of Athena have already experienced.

There are two primary trade-offs with Athena. Firstly, transparency – “you get what you get” – you have no visibility to the underlying infrastructure serving your queries nor do you have the ability to control or tweak that infrastructure for your workloads. The approach works for cases where the performance/latency is not critical, but it can be a non-starter for users who need control. Secondly, despite all its advantages, shared services serverless models have disadvantages related to performance predictability.

AWS has documented best practices in their performance tuning guide to get the most out of Athena and to avoid typical errors that users encounter such as “Query exceeded local memory limit” and “Query exhausted resources at this scale factor” these include using optimized formats such as parquet and orc as well as avoid small files and partitioning the data appropriately.

How does the serverless architecture impact my query performance?

Athena at its core is a shared serverless service per region – to protect the usage spiraling out of control by a handful of customers Amazon has placed restrictions on the usage, size, concurrency, and scale of the service on a per-customer basis and within the region overall.  

These limits and restrictions are well documented in the service limits such as 20 active DDL and DML Queries each in most regions, 30-minute max query timeouts, API limits, and throttling among others (Some of these limits can be tweaked by working with Amazon support). These limits are guard rails around the service so that the usage of one customer doesn’t adversely affect the experience of another. These guardrails are however far from perfect since there are only a finite number of resources per region to be shared across customers in the region. Any excessive load due to seasonal or unexpected spikes at Amazon’s scale will easily consume the shared resources causing contention and queuing. 

In addition to the shared query infrastructure, Athena’s federated query connectors are based on Lambda, which is again serverless. Lambda scales out well and can be performant once warmed-up, however consistent performance comes only with consistent use. Depending upon the usage of a particular connector in the region and the available capacity of the backend infrastructure you could run into latencies caused by cold-starts, especially if you are using connectors that are not accessed frequently e.g. custom connectors.

If a large number of users end up using Athena at the same time, especially for large-scale queries, they often observe extended queuing of their queries. Though the eventual query execution might not take time once resources are available, the queueing significantly impacts the end-user experience for interactive query workloads. Users have also at times reported inconsistency of execution times of the same query from one run to another which ties back into the shortcomings of the shared resources model. 

So can I get deterministic performance out of Athena? 

If your queries are not complex, latency-sensitive and your infrastructure is in a less crowded region, you may not encounter performance predictability issues frequently. However, your mileage entirely depends upon several factors such as when you are running the query, which region are you running the query, the volume of the data you are accessing, your account service-limit configurations, to just name a few. 

What are my options?

If your interactive query workload is latency-sensitive and you want to deterministically control the performance of your queries and the experience of your end-users, you need dedicated managed infrastructure. A managed Presto service gives you the best of both worlds – It abstracts the complexity of managing a distributed query service at the same time giving you the knobs to tweak the service to your workload needs.

Ahana provides a managed Presto service that can scale up and down depending on your performance needs. You can segregate workloads into different clusters or choose to share the cluster. You can also choose beefier infrastructure for more business and time-critical workloads and also set up separate clusters for less critical needs. You make that choice of price, performance, and flexibility depending upon business objectives.

Do I have to use AWS Lambda to connect to data sources with Athena?

The Athena Federated Query Journey

AWS announced the public preview of Athena federated query in November 2019 and moved the service to General Availability (GA) in November 2020. As of this writing, Athena supports two versions of Athena – engine 1 based on Presto 0.172, and engine 2 based on Presto 0.217. Among other features, the federated query functionality is only supported on engine 2 of Athena. 

What connectors are supported?

Athena currently supports a number of data sources connectors

  • Amazon Athena CloudWatch Connector
  • Amazon Athena CloudWatch Metrics Connector
  • Athena AWS CMDB Connector
  • Amazon Athena DocumentDB Connector
  • Amazon Athena DynamoDB Connector
  • Amazon Athena Elasticsearch Connector
  • Amazon Athena HBase Connector
  • Amazon Athena Connector for JDBC-Compliant Data Sources (PostgreSQL, MySQL, and Amazon Redshift)
  • Amazon Athena Neptune Connector
  • Amazon Athena Redis Connector
  • Amazon Athena Timestream Connector
  • Amazon Athena TPC Benchmark DS (TPC-DS) Connector

In line with the serverless model of AWS Athena, in order to maintain an impedance match between Athena and the Connectors, the connectors are also Serverless based on AWS Lambda and run within the region. These connectors are open-sourced under the Apache 2.0 license and are available on GitHub or can be accessed from the AWS Serverless Application Repository. Third-party connectors are also available in the serverless application repository.

Athena is based on Presto, Can I use the open-source Presto data source connectors with Athena?

Presto has an impressive set of connectors right out of the box, these connectors however cannot be used as-is with Athena. The Presto service provider interface (SPI) required by the Presto connectors is different from AWS Athena’s Lambda-based implementation which is based on the Athena Query Federation SDK.

You can use the Athena Query Federation SDK to write your own connector using Lamba or to customize one of the prebuilt connectors that Amazon Athena provides and maintains. The Athena connectors use Apache Arrow format for returning data requested in the query. An example Athena connector can be found here.

If you’re looking for a managed service approach for PrestoDB, Ahana Cloud offers that based on open source Presto. Ahana’s managed deployment is 100% compatible with the open source Presto connectors, and you can get started with a click of a button.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

How do I do geospatial queries and spatial joins in Presto?

A question that often comes up is “how do I do geospatial queries and spatial joins in Presto?”. Fortunately Presto supports a wealth of functions and geospatial-specific joins to get the job done.

Let’s get started with a step-by-step tutorial. First we’ll set-up some test data in two tables.  The first table is trips_table which will store 3 rows each representing simple taxi trips. We store a trip id, the origin and destination long/lat coordinates, and the journey start time:

create table trips_table (
trip_id int, 
orig_long double, 
orig_lat double, 
dest_long double, 
dest_lat double, 
trip_start timestamp);

insert into trips_table values (1, 51.50953, -0.13467, 51.503041, -0.117648, cast('2021-03-02 09:00:00 UTC' as timestamp));
insert into trips_table values (2, 34.039874, -118.258663, 34.044806, -118.235187, cast('2021-03-02 09:30:00 UTC' as timestamp));
insert into trips_table values (3, 48.858965, 2.293497,48.859952, 2.340328, cast('2021-03-02 09:45:00 UTC' as timestamp));
insert into trips_table values (4, 51.505120, -0.089522, 51.472602, -0.489912, cast('2021-03-02 10:45:00 UTC' as timestamp));

For information:

  • Trip 1 is a ride within central London, from Piccadilly Circus to the London Eye.
  • Trip 2 is a ride in downtown Los Angeles
  • Trip 3 is a ride from the Eiffel Tower to Musée du Louvre, Paris 
  • Trip 4 is a ride from Borough Market in central London to Heathrow Airport Terminal 5 (outside central London).

The second table is city_table with each row storing the shape representing an area e.g. central London, and a name for the shape. We represent the shape with a sequence of coordinates that enclose a specific area: 

create table city_table (
geo_shape varchar, 
name varchar);

insert into city_table values ('POLYGON((51.519287 -0.172316,51.519287 -0.084103,51.496393 -0.084103,51.496393 -0.172316,51.519287 -0.172316))', 'London, central');
insert into city_table values('POLYGON((33.9927 -118.3023,33.9902 -118.1794,34.0911 -118.2436,33.9927 -118.3023))', 'Los Angeles, downtown');

For information:

  • I used a simple triangle to represent downtown LA (see illustration below)
  • I used a rectangle representing central London. 
  • In each case the first pair coordinates for the shape are the same as the last pair – so it’s an enclosed bounding box or polygon we’re describing.
  • We’re storing our shapes as text in a varchar column for simplicity.
  • We describe each polygon as comma-separated pairs of long/lat coords using the POLYGON(()) function.  The double brackets are required. 
Simple polygon to represent downtown LA 
(credit: “Geospatial Data Analysis using SQL”)

Now let’s run a query to count how many trips occurred in each city. We join our two tables, and we use each journey’s originating long/lat coordinates to determine – using ST_CONTAINS() – if that point exists in any of our shapes.  This function requires the polygon to be expressed as a special type – Geometry – so we convert our shape from text using ST_GeometryFromText() function:

SELECT c.name as City, count(*) as Trips 
FROM trips_table as t 
JOIN city_table as c 
ON ST_Contains(ST_GeometryFromText(c.geo_shape), st_point(t.orig_long, t.orig_lat)) 

         City          | Trips 
 Los Angeles, downtown |     1 
 London, central       |     2 
(2 rows)

We see both London trips made it into the result set, despite one of the trips ending at the airport which is a way outside the shape we defined for central London – this is because the query uses the originating coordinates for each trip, not the destination coordinates. 

Also notice the Paris trip didn’t make it into the result – this is because we did not define a shape for Paris.

In this example you’ve seen some of the benefits of using Ahana Cloud for Presto

Presto’s Geospatial functions are listed in the Presto documentation.  

We have a ton of resources to help you get started with Presto, check them out here.

How do I query JSON documents with Presto?

JSON documents are a common data type. A lot of people collect logs and load them into S3. Querying JSON with Presto can be challenging because you may not always have a consistent schema. However, the great part about Presto is that it has functionality that enables you to get insights on your JSON data.

Presto has a JSON data type. The JSON data type is similar to a string. One way to think about this is that it is just a string containing JSON, with the exception that the data inside the string has already been parsed, it’s already well-formed JSON.

But it’s also slightly semantically different in that JSON has maps. And when you have a string with a map, the keys of the map can be ordered differently. But in the JSON data type, the keys in the map are always going to be in a consistent order. And it’s worth noting that two maps of the data with different JSON data type values will compare the same, even if the keys are in different order. This can be important when doing an aggregation regrouping or a comparison. Logically two maps will be treated as the same value, no matter what the order is.  

How the JSON parse function works

The JSON parse function takes a JSON string and returns a JSON data type. Here is an example of the JSON data type:

SELECT json_parse('null'); -- JSON 'null'
SELECT json_parse('true'); -- JSON 'true'
SELECT json_parse('42'); -- JSON '42'
SELECT json_parse('"abc"'); -- JSON '"abc"'
SELECT json_parse('[1, 2, 3]'); -- JSON '[1,2,3]'
SELECT json_parse('["a": 1, "b": 2]'); -- JSON '["a": 1, "b": 2]'
SELECT json_parse('[”hey”, 42, {“xyz”: 100, “abc” : false}]’); 

After its parts with json_parse, the result is that the keys in the map are now ordered. This is how you result with the same JSON data type and how they compare the same. When two different maps with differently ordered keys are parsed, they’ll have the same contents.

In Presto, all the JSON functions take the JSON data type. And if you pass a string into the JSON function, it will be implicitly parsed into JSON.

Another question that comes up is what is the difference of NULL between JSON and SQL? J_null and a SQL null?

This can be confusing because JSON has a NULL and so does SQL. These are two different things. A SQL NULL is considered a special value and work very differently like they do in other programming languages. Operations on a NULL gets turned into a NULL. Whereas a JSON NULL is just another value, like a number or a string or an array. So they’re totally different. This has implications when doing Boolean expressions.

We have a lot more resources on Presto if you’d like to learn more, check out our Getting Started with Presto page.

Is there latency overhead for Presto queries if everything fits into memory and doesn’t need to be distributed?

Presto is both in-memory and distributed, so each work has memory and uses it. However, Presto is not an in-memory database. Presto is the query engine and reads from storage underneath. This is where Presto’s caching capabilities become important.

If all the data required to satisfy a given query arrives from the connected data sources and fits within the memory of the Presto cluster, everything should work accordingly. However, Presto’s MPP, in-memory pipelining architecture will accelerate workloads further in two ways:

1) Presto will ensure data is divided across all the workers to bring maximum processing power and memory capacity to bear.

2) Presto’s generated query plan will ensure execution steps are distributed amongst all the workers such that processing takes place in parallel on the data across all workers simultaneously, as efficiently as possible. Furthermore, with its caching capabilities, Presto can accelerate query execution even further for specific workload patterns, further reducing latency.

If you want to get started with Presto, check out our docs to learn more about the Ahana Cloud managed service for Presto.

Is the Hive metastore a hard dependency of Presto, or could Presto be configured to use something else like Postgres?

With Presto, there’s no hard dependency of having to use the Hive metastore – it’s catalog-agnostic. However, there are significant advantages like better performance when you use the Hive metastore.

A managed service like Ahana Cloud for Presto provides a managed version of a Hive metastore catalog, so you don’t have to deal with the complexity of managing their own metastore. Ahana also supports AWS Glue as a data catalog service.

You don’t necessarily have to use the metastore if you are connecting via the provided connectors such as MySQL, PostGres, Redshift, Elastic, etc. But if you intend to query data on S3, for instance, you will need a metastore (i.e. Hive or Glue) to define external tables and their formats.

Presto has many connectors including Postgres, which is a supported data source.

The Differences Between Apache Drill vs Presto

Drill is an open source SQL query engine which began life as a paper “Dremel: Interactive Analysis of Web-Scale Datasets” from Google in 2010. Development of Apache Drill began in 2012.

Performance & Concurrency
Drill has never enjoyed wide adoption partially because it was tied to one Hadoop distribution (MapR) but mainly because of inherent performance and concurrency limitations. There are companies that built products based on Drill who report these performance and concurrency issues, and many have migrated away from Drill as a result. Presto’s popularity and adoption, on the other hand, has exploded as numerous companies from SME to web-scale have deployed Presto and contribute to the project. These include Facebook, Uber, Alibaba, Twitter, Netflix, AirBnB and LinkedIn. Users value Presto’s vendor-neutrality and rate of innovation.

Presto connects to external metastores (AWS Glue, Hive Metastore Catalog); many users deploy Presto + AWS Glue/Hive for their data lake analytics. In addition, schema for relational data sources can be obtained directly from each connector. On the other hand, Drill performs its own schema discovery and does not need a metastore such as Hive (but can use Hive if needed), which is a benefit.

Compared to Presto, the Apache Drill community has dwindled somewhat especially when it comes to adoption and contributions. We talk to many users who are looking to move from Apache Drill to Presto. If we look at DB engines – https://db-engines.com/en/ranking/relational+dbms – we see that Presto has continued on a positive upward trend and ranks #27, as opposed to Drill which is at #49 in a downward trend.

Overall, Drill is in decline. This was accelerated when HPE (Hewlett Packard Enterprise, who acquired MapR) announced they will no longer support or contribute to Drill (a drawback of having a query engine tied to an ecosystem such as Hadoop). Presto’s popularity continues to increase, as illustrated by 1) the number of commits to the project (In the past month 46 authors have pushed 79 commits to master and 88 commits to all branches, and on master 568 files have changed and there have been 10,930 additions – as of Feb 26 2021), and 2) Presto’s continued rise on DB-engines.

Drill is a top-level Apache Foundation project but does not have a strong and active community behind it. Presto is backed by a strong community and is overseen by the Presto Foundation which is part of the Linux Foundation with 8 premier organizations driving it including Facebook, Uber, Twitter, Intel and Ahana.

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

Why am I getting zero records when I use AWS Athena to query a CSV file?

There’s a common error many AWS Athena users see when they query CSV files – they will get zero records back. This tends to happen when they run the AWS Glue crawler and create a new table on the CSV file. There are several reasons why the query might return no records.

  1. The crawler is pointing to a file instead of an AWS S3 bucket
  2. The LOCATION path is wrong or there’s a double slash in it
  3. The partitions haven’t loaded into the Glue Data Catalog or the internal Athena data catalog
  4. In general, CSV crawlers can be sensitive to different issues (i.e. embedded newlines, partially quoted files, blanks in integer fields), so make sure everything is accurate

On the first point, if you have selected a file instead of your S3 bucket, the crawler will succeed but you won’t be able to query the contents which is why you’ll see the ‘Zero Records Returned’ error message. If there are other files that you don’t want crawled, you’ll need to create a new folder and move your CSV to that new folder, and then update the include path accordingly (and you’ll need to re-crawl it).

In order to get your S3 data lake to work, you’ll need to make sure each batch of same-schema files has its own top level directory.

If you’re running into issues like this with AWS Athena and want to offload the management while still getting the power of Presto, check out Ahana Cloud. It’s a fully managed service for AWS that removes these types of complexities and makes it really easy to run Presto in the cloud.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

Does Presto work natively with GraphQL?

Some users may have a primary data store that is GraphQL-based (AWS AppSync) and want to leverage Presto. For context, GraphQL falls in the application as part of the service mesh infrastructure. It is an abstraction API for mostly operational sources.

Typically for analytics, users move their data into an analytical stack, because the properties and queries of an operational stack and analytical stack are quite different.

Users with your environment move the data to S3 to run data lake analytics. Presto is the de facto engine for data lake analytics, but in addition to that is also allows for running queries across S3 and operational systems like RDS.

Running analytics on GraphQL / AppSync is not recommended because:

  1. GraphQL isn’t really a query engine with an optimizer, etc.
  2. Analytical queries are fairly intensive from a compute perspective and needs the right analytical engine to process and execute the queries. GraphQL is mostly a pass through and this will affect your operational systems.

Want to learn more about Presto? Download our free whitepaper: What is Presto?

Why does a single AWS Athena query get stuck in QUEUED state before being executed?

One of the drawbacks of AWS Athena is the fact that as a user, you don’t have control over query performance and predictability. One specific issue that comes up quite a bit for Athena users is single queries getting stuck in a QUEUED state before being executed. This happens because of Athena’s serverless nature in combination with its multi-tenant service – you are essentially competing for resources when you run a query.

Many users will get the message “Query waiting in queue” somewhat randomly. For this reason, it’s recommended to not use Athena for user-facing requests because of its unpredictability, and generally many users have decided they can’t rely on Athena at all based on this issue. If you can’t predict demand and manage multi-tenancy, it’s not ideal for ad-hoc analytics (which is a core use case for PrestoDB).

For users who still want to leverage PrestoDB for ad-hoc analytics and data lake analytics, Ahana Cloud is a solution that allows you to run Presto transparently as a managed service. You won’t run into the issues around unpredictability and queries getting stuck. There’s a 14 day free trial if you’re interested in checking it out.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

How Presto Joins Data

Because Presto is a distributed system composed of a coordinator and workers, each worker can connect to one or more data sources through corresponding connectors.

The coordinator receives the query from the client and optimises and plans the query execution, breaking it down into constituent parts, to produce the most efficient execution steps. The execution steps are sent to the workers which then use the connectors to submit tasks to the data sources. The tasks could be file reads, or SQL statements, and are optimised for the data source and the way in which the source organises its data, taking into account partitioning and indexing for example.

The data sources supported by Presto are numerous and can be an RDBMS, a noSQL DB, or Parquet/ORC files in an object store like S3 for example. The data sources execute the low level queries by scanning, performing filtering, partition pruning etc. and return the results back to the Presto workers. The Presto join operation (and other processing) is performed by the workers on the received data, consolidated, and the joined result set is returned back to the coordinator.

You will notice Presto uses a “push model” which is different, for example, to Hive’s “pull model”. Presto pushes execution steps to the data sources, so some processing happens at the source, and some happens in Presto’s workers. The workers also communicate between each other, and the processing takes place in memory which makes it very efficient, suitable for interactive queries.  Hive on the other hand will read/pull a block of a data file, execute tasks, then wait for the next block, using the map reduce framework. Hive’s approach is not suitable for interactive queries since it is reading raw data from disk and storing intermediate data to disk, all using the framework MapReduce, which is better suited to long-running batch processing. This diagram compares Hive and Presto’s execution approaches:

How Presto Joins Relational and Non-Relational Sources

The next diagram shows some of Presto’s core Coordinator components, and the kinds of tasks   Presto’s workers handle. In this simplistic example there are two data sources being accessed; one Worker is scanning a Hive data source, the other worker is scanning a mongoDB data source. Remember Presto does not use Hive’s mapreduce query engine or HQL – the diagram’s “hive” worker means it is using the “hive connector” and the file system is the metastore information, and the raw source data is external to Presto, maybe in HDFS in Parquet or Orc format, for example.  The Worker dealing with mongo data is described as being on “the probe side” (in this example) whereby the mongo data is read, processed, normalized into columnar stores in Presto and then shuffled (or exchanged) across the cluster to the “builder” worker (the worker dealing with the hive data here) for the actual Presto joins to take place.

This is a simplistic example since in reality Presto is more sophisticated – the join operation could be running in parallel across multiple workers, with a final stage running on one node (since it cannot be parallelized). This final stage is represented by the third worker at the top of the diagram labeled  “Output”. This outputNode’s task is to stream out the result set back to the coordinator, and then back to the client.

Joins – Summary

It’s easy to see how Presto is the “Polyglot Data Access Layer” since it doesn’t matter where your data lives, any query can access any data, in-place, without ETL or data shipping or duplication. It is true federation. Even when blending very different sources of data, like JSON data in elasticsearch or mongodb with tables in a MySQL RDBMS, Presto takes care of the flattening and processing to provide a complete, unified view of your data corpus.

If you want to try out Presto, take a look at Ahana Cloud. It provides a managed service for Presto in AWS.

Executing Presto Spark

Executing Presto Spark queries is possible, but why leverage Spark as an execution framework for Presto’s queries when Presto is itself an efficient execution engine?  The fact that both are in-memory execution engines tends to add further confusion.

Presto and Spark are different kinds of engines – Presto specializes in analytical query execution; Spark’s emphasis is on calculation. The following should help clarify why Presto users would or would not want to use Spark as an additional engine. 

A Quick Presto/Spark Comparison 

Executing Presto queries on Spark can be useful for some very specific workloads,  such as queries that we want to run on thousands of nodes, or for queries requiring 10s or 100s of TBs of memory, queries that may consume many CPU years, or if we need extra execution resilience.  So if you have a particularly ugly, complex and very long-running query, this may be an option. 

Spark provides several features that can be desirable for certain workloads, like:

  • Resource isolation
  • Fine-grained resource management
  • Spark’s scalable materialized exchange mechanism 
  • In terms of memory usage, both are in-memory execution engines but with one important difference: Spark will write/spill data to disk when memory is exhausted.
  • Spark has data processing fault tolerance, so if a worker node fails and its data partition goes away,that partition will be re-computed. 

There are also some downsides, or at least considerations, when using Spark with Presto: 

  • Spark commits tasks and applies for resources dynamically as it needs them. So at each stage tasks will grab their required resources, a strategy that can slow down processing speed. Presto on the other hand does this “upfront” on the most part..
  • The way in which data is processed by the two engines differs: Spark needs data to be fully processed before progressing to the next processing stage. Presto uses a pipeline processing approach where data is sent to the next task as soon as possible which can reduce total execution time.
  • Spark SQL supports a subset of the ANSI SQL standard. Many features are missing, and for this reason many developers avoid Spark SQL. 

When To Use Spark’s Execution Engine With Presto

It is recommended that you:

  • Do not use Spark for interactive queries. This is not what Spark was designed for, so performance will be poor.
  • Do use Spark for long-running, SQL-based batch ETL queries, provided Spark SQL supports the features and functions you need. You can run ETL/batch type queries in Presto, but you may choose to use Spark as the execution engine with Presto if a user / developer wants to test their query in adhoc mode using Presto first and later convert it to Spark when they want to run a batch pipeline in production.

Another scenario where Presto and Spark can be used together is if there is no connector available in Spark for the data source you want to access then use Presto and its connectors to access the data then access Presto via JDBC from Spark.


Setting up Presto and Spark and submitting jobs is well documented in PrestoDB’s docs at https://prestodb.io/docs/current/installation/spark.html and it is straightforward to set-up.  You will need Java and Scala installed. You need a working Presto cluster, and a Spark cluster of course. Then there are two packages to download, and a file to configure. You then execute your queries from the Spark cluster, by running spark-submit.

Tip: If you are using a Mac, standalone Spark can be installed very simply with brew install apache-spark. Similarly Presto can be installed with  brew install prestodb. 

Verify your Spark cluster by running spark-shell on the command line.


Presto’s config.properties file needs to have its task.concurrency, task.max-worker-threads and task.writer-count parameters set to values appropriate for your hardware. Ensure these values also match the parameters used with the spark-submit command.


Use the spark-submit command to invoke Spark, passing the sql query file as an argument. See https://prestodb.io/docs/current/installation/spark.html for an example spark-submit command. 

The above article should help you understand the how, why and when of executing Presto queries on Spark.

When I run a query with AWS Athena, I get the error message ‘query exhausted resources on this scale factor’. Why?

AWS Athena is well documented in having performance issues, both in terms of unpredictability and speed. Many users have pointed out that even relatively lightweight queries on Athena will fail. One part of the issue may be due to how many columns the user has in the Group By clause – even a small amount of columns (like less than 5 columns) will run into this issue of not having enough resources to complete. Other times it may be due to how much data is being parsed, and again even small amounts of data (like less than 200MB) will run into this issue of not having enough resources to complete.

Presto stores Group By columns in memory while it works to match rows with the same group by key. The more columns that are in the Group By clause, the fewer number of rows that will get collapsed with the aggregation. To address this problem, users will have to reduce the number of columns in the Group By clause and retry the query.

And still at other times, the issue may not be how long the query takes but if the query runs at all. Users that experience “internal errors” on queries one hour will re-run the same queries that triggered those errors and they will succeed.

Ultimately, AWS Athena is not predictable when it comes to query performance. That’s where Ahana Cloud, a managed service for Presto, can help. Ahana’s managed service for PrestoDB can help with some of the trade offs associated with a serverless service.

Some of the reasons you might want to try a managed service if you’re running into performance issues with AWS Athena:

  • You get full control of your deployment, including the number PrestoDB nodes in your deployment and the node instance-types for optimum price/performance. 
  • Consistent performance because you have full control of the deployment
  • Ahana is cloud-native and runs on Amazon Elastic Kubernetes (EKS), helping you to reduce operational costs with its automated cluster management, increased resilience, speed, and ease of use. 
  • No limits on queries

Plus you can use your existing metastore, so you don’t need to modify your existing architecture.

Check out the case study from ad tech company Carbon on why they moved from AWS Athena to Ahana Cloud for better query performance and more control over their deployment.

Creating tables in a S3 Bucket gives “Query failed External location must be a directory”

So here’s why you are here. You are using Presto’s Hive connector and you want to create a new table using an S3 folder that already exists. You have full read/write access to the folder, you can see the folder, but Presto tells you “External location must be a directory”.

Whiskey Tango Foxtrot – The directory isn’t a directory? 

In this example, you have already created your target directory myDir using the AWS S3 console, in a bucket called myBucket,  but attempts to create the customer table in that directory fail:

presto> CREATE TABLE hive.customer ( customer_id bigint, email varchar(64) ) 
WITH ( external_location = 's3a://myBucket/myDir/' , format = 'PARQUET' );

Query 20210104_110546_00015_mhsym failed: External location must be a directory

The issue here is not immediately obvious, read on further to learn more.

Why is the issue occurring?

S3 is an immutable object store and not a traditional file system. Objects are stored as key-value pairs internally, the key is the “Path” and the value is the actual object. There is no such thing as a folder in the traditional filesystem sense of the world. All there is are objects and associated metadata. If the path doesn’t have the appropriate metadata associated with it (in this case content-type: ‘application/x-directory’) Presto will complain that the path is not a directory

You have 2 primary options of addressing the issue:

  1. Manually associate the metadata via the AWS Console UI
  2. Copy the object over again using S3 command line or python script

Note: Since S3 is an immutable store, so there is no way to simply “update the metadata”. Under the covers, if you use the AWS Console, the UI is performing a clone of the object and updating the metadata. If you need to do it via API you have to perform a copyObject and update the metadata. 

See snippet from the AWS docs:  https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html

“You can set object metadata in Amazon S3 at the time you upload the object. Object metadata is a set of name-value pairs. After you upload the object, you cannot modify object metadata. The only way to modify object metadata is to make a copy of the object and set the metadata.

So how do I fix the issue?

Option 1: Change metadata using the AWS Console UI

Choose “Edit metadata” for the path in question

Choose Edit metadata for the Path

Add ‘Content-Type’ as ‘application/x-directory’ as a system defined metadata

Set Content-Type to ‘application/x-directory’

Option 2: Change metadata using tools (copy objects over again)


aws s3 cp --content-type 'application/x-directory' s3://<bucket>/<object> s3://<bucket>/<object> --metadata-directive REPLACE


#Put with metadata
╰$ aws s3api put-object --bucket mybucket --key folder_name_xxx/ --content-type application/x-directory
    "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\""
#Verify the metadata
╰$ aws s3api head-object --bucket mybucket --key folder_name_xxx/
    "AcceptRanges": "bytes",
    "LastModified": "Fri, 23 Apr 2021 16:03:21 GMT",
    "ContentLength": 0,
    "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
    "ContentType": "application/x-directory",
    "Metadata": {}

Python Script

import boto3
s3 = boto3.resource('s3')
s3_object = s3.Object('bucket-name', 'key')
s3_object.copy_from(CopySource={'Bucket':'bucket-name', 'Key':'key'}, Metadata=s3_object.metadata, MetadataDirective='REPLACE')
s3 = boto3.client('s3')
s3.put_object(Bucket=BUCKET, Key=KEY, ContentType='application/x-directory')

The above-mentioned steps should help you resolve the issue and your create table should work fine without complaining about the ‘External location must be a directory’ error.

--Now CREATE TABLE works fine
presto> CREATE TABLE hive.customer ( customer_id bigint, email varchar(64) )
 WITH ( external_location = 's3a://myBucket/myDir/' , format = 'PARQUET' );

Big Data Query

When it comes to querying big data using standard SQL you have come to the right place as this is what Presto was designed to do. Presto, the leading open source distributed query engine,  gives users the ability to interactively query any data and any data source at any scale.  And it is proven – Presto has been in large-scale production deployment at internet-scale companies like Facebook, Uber, and Twitter for several years and the project is enjoying continued innovation as a result. 

Ahana’s mission is to simplify ad hoc analytics for organizations of all shapes and sizes by making open source Presto a fully integrated, cloud-native, managed service for AWS – the easiest Presto experience there is. 

The role of a big data analyst is to discover patterns, trends and correlations in large datasets. Big data analytics is the umbrella term for such activities as generating reports, creating dashboards, to running complex queries.  A big data query is a request for data from a database table or from a combination of many different tables or files, which may be in different file formats, in relational databases and in object storage like S3.

In the past, analysts typically queried data stored in traditional relational databases and data warehouses – like a data warehouse. These days, the amount of data being generated is too big to be stored solely in on-prem relational databases; as a result, big data is now increasingly stored in the cloud and/or in distributed clusters. To query these huge and disperse datasets, you need a distributed query engine like Presto.

The good news is, if you work with terabytes or even petabytes of data, Presto is the ideal high-performance, distributed SQL query engine for querying huge and diverse datasets. It’s an open source, distributed SQL query engine that is specifically designed to run big data queries against data of any size, on a wide variety of sources. It makes use of multiple connectors which allow you to access these data sources and query them in place; there is no need to move the data.

One of the main benefits of Presto is that it separates data storage from processing, so analysts can query big data where it is stored without having to move all of the data into a separate analytics system. With just a single Presto query, you can query big data from multiple sources with fast response times ranging from sub-seconds to minutes. Another benefit of Presto is that big data analysts can use standard query language (SQL) to query the data, which means they don’t have to learn any new complex languages.

This is why Presto is becoming the de facto standard for Big Data Querying.

I use RDS Postgres databases and need some complex queries done which tend to slow down my databases for everyone else on the system. What do I need to consider as I add a data lake for the analytics?


Many medium-sized companies start out using one of the six flavors of Amazon RDS: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. As they grow they can have numerous instances of RDS with numerous databases for each customer, whether internal or external. When one customer tries a large analytic query, that workload can cause problems with the RDS cluster, perhaps making it to drop other workloads, fail, or slow down for others. As the needs for processing huge amounts of data increases, so does the need to take your analytics to the next level.

In addition to your operational databases, the idea is to have a much more open analytics stack where you have the ability to run different kinds of processing on the same data. A modern analytics stack lets your organization have much more insights without impacting your operational side. And doing that with open data formats is another key consideration.


There’s a couple options for evolving your analytics stack. One would be to use a cloud data warehouse like Amazon Redshift or Snowflake. Another would be to use open formats in a data lake with a modern SQL query engine. In the first case, there are some advantages of having the highest performance possible on your data but it comes at a certain cost as well as an amount of lock-in, as you cannot easily get at the data in proprietary formats. Considering the data lake with query engine option, we believe that Presto is one of the best choices because of its performance, scalability, and flexibility to connect to S3-based data lakes and federate other data sources as well.

So our recommendation would be to run Presto on top of data stored in an open Parquet or ORC format in S3. Doing it this way, you can put other engines on top of the data as needed, do you’re not going to face a lot of rework in the future should you decide to change something.

From OLTP to Data Lake Analytics

The high level concept is to have an initial one-time bulk migration of your data in OLTP databases to get a copy moved in to S3. After that, as your operational databases will continue to generate or change data, you’ll need establish a pipeline, a stream, or a Change Data Capture (CDC) process in place to get those into S3. BTW, not often will you want data going back from S3 into your relational databases.

While there are different ways to pipe data into S3, one AWS recommended approach is to use the AWS Database Migration Service a.k.a. DMS(much in the same way you may have used it when you migrated off-prem). With AWS Database Migration Service, you can get the first one-time bulk load and then continuously replicate your data with high availability and stream data to Amazon S3. AWS DMS would run in the background and handle all of the data changes for you. You can pick the instance and what period of time you’d like it to run, for example, you may want it to run hourly partitions or daily partitions. That’ll depend on how fast your data is changing and what your requirements are for analytics as well.

Next you’ll want to install Presto on top, and for that you can build a Presto cluster yourself, or simply use Ahana Cloud for Presto to create pre-configured clusters in about 30 minutes.

It’s also worth noting that after you’ve ingested the data into S3 based on what you think the most optimized format or folder structure would be, you may find out that you you need it different. In that case, not to worry, you can use Presto itself to do data lake transformations as well, using SQL you can do a CTAS, Create Table As Select:

The CREATE TABLE AS SELECT (CTAS) statement is one of the most important SQL features available. CTAS is a parallel operation that creates a new table based on the output of a SELECT statement. CTAS is the simplest and fastest way to create and insert data into a table with a single command.


By adding a modern analytics stack to your operational databases, you evolve your analytics capabilities and deliver more insights for better decisions. We suggest moving to an open data lake analytics reference architecture with Presto. This allow a meta analysis of all your data, giving a look at broader trends across databases and other data sources.

Advanced SQL Tutorial


Presto has a wide-range of JSON functions. Consider this json test input data (represented in the query using the  VALUES function) which contains 3 key/value elements. The key is “name” and the value is a dog breed. If we want to select the  the first (0th) key/value pair we would code:

SELECT json_extract(v, '$.dogs) AS all_json, 
  json_extract(v, '$.dogs[0].name') AS name_json, 
  json_extract_scalar(v, '$.dogs[0].name') AS name_scalar 
(VALUES JSON ' {"dogs": [{"name": "Beagle"}, {"name": "Collie"}, {"name": "Terrier"}]} ') AS t (v);
                         all_json                         | name_json | name_scalar 
 [{"name":"Beagle"},{"name":"Collie"},{"name":"Terrier"}] | "Beagle"  | Beagle      
(1 row)

All of Presto’s JSON functions can be found at: https://prestodb.io/docs/current/functions/json.html 

Arrays, Unnesting, and Lambda functions 

Consider the following array of test data elements, and simple query to multiple each element by 2:

SELECT elements,
    ARRAY(SELECT v * 2
          FROM UNNEST(elements) AS v) AS my_result
        (ARRAY[1, 2]),
        (ARRAY[1, 3, 9]),
        (ARRAY[1, 4, 16, 64])
) AS t(elements);
    elements    | my_result
 [1, 2]         | [2, 4]
 [1, 3, 9]      | [2, 6, 18]
 [1, 4, 16, 64] | [2, 8, 32, 128]
(3 rows)

The above query is an example of nested relational algebra which provides an fairly elegant and unified way to query and manipulate nested data. 

Now here’s the same query, but written using a lambda expression. Why use lambda expressions?  This method makes querying nested data less complex and the code simpler to read/develop/debug, especially when logic gets more complicated:

SELECT elements, 
transform(elements, v -> v * 2) as my_result
        (ARRAY[1, 2]),
        (ARRAY[1, 3, 9]),
        (ARRAY[1, 4, 16, 64])
) AS t(elements);

Both queries return the same result. The transform function and “x -> y” notation simply means  “do y to my variable x”.

To see more lambda expression examples check out: https://prestodb.io/docs/current/functions/lambda.html 

Counting Distinct Values

Running a count(distinct xxx) function is memory intensive and can be slow to execute on larger data sets. This is true for most databases and query engines.  The Presto-cli will even display a warning reminding you of this.  

A useful alternative is to use the approx_distinct function which uses a different algorithm (the HyperLogLog algorithm) to estimate the number of distinct values.  The result is an approximation and the margin of error depends on the cardinality of the data. The approx_distinct function should produce a standard error of up to 2.3% (but it could be higher with unusual data). 
Here’s an example comparing distinct and approx_distinct with a table containing 160.7 million rows.  Data is stored in S3 as Parquet files and the Presto cluster has 4 workers. We can see approx_distinct is more than twice as fast as count(distinct xxx):

presto:amazon> select count(distinct product_id) from review;
(1 row)
WARNING: COUNT(DISTINCT xxx) can be a very expensive operation when the cardinality is high for xxx. In most scenarios, using approx_distinct instead would be enough
Query 20201231_154449_00058_npjtk, FINISHED, 4 nodes
Splits: 775 total, 775 done (100.00%)
0:56 [161M rows, 1.02GB] [2.85M rows/s, 18.4MB/s]
presto:amazon> select approx_distinct(product_id) from review;
(1 row)
Query 20201231_154622_00059_npjtk, FINISHED, 4 nodes
Splits: 647 total, 647 done (100.00%)
0:23 [161M rows, 1.02GB] [7.01M rows/s, 45.4MB/s]

Complex SQL Queries

Complex SQL queries benefit from Presto’s distributed, parallel, in-memory processing architecture and cost-based optimizer. And with Presto’s federation capabilities even more complex queries can be unleashed on multiple data sources in a single query. 

To ensure optimum performance with complex queries Presto has features like dynamic filtering which can significantly improve the performance of queries with selective joins by avoiding reading data that would be filtered by join condition.  And the collection of table statistics using ANALYZE tablename is highly recommended. 

Complex SQL features and functions are used when advanced calculations are needed, or when many tables (perhaps from multiple sources) are to be joined, when dealing with nested or repeated data, dealing with time-series data or complex data types like maps, arrays, structs and JSON, or perhaps a combination of all these things.  

Presto’s ANSI SQL engine supports numerous advanced functions which can be split into the following categories – links to the PrestoDB documentation are provided for convenience:

Advanced SQL Queries with Presto

Advanced SQL features and functions are used by analysts when, for example, complex calculations are needed, or when many tables (perhaps from multiple sources) need to be joined, when dealing with nested or repeated data, dealing with time-series data or complex data types like maps, arrays, structs and JSON, or perhaps a combination of all these things.  

Presto’s ANSI SQL engine supports numerous advanced functions which can be split into the following categories – links to the PrestoDB documentation are provided for convenience:

Running advanced SQL queries can benefit greatly from Presto’s distributed, in-memory processing architecture and cost-based optimizer. 

Presto Platform Overview

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes. Some of the world’s most innovative and data-driven companies like Facebook, Walmart and Netflix depend on Presto for querying data sets ranging from gigabytes to petabytes in size. Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.

Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Presto allows querying data where it lives, including Hive, Cassandra, relational databases, HDFS, object stores, or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization. Presto is an in-memory distributed, parallel system. 

Presto is targeted at analysts who expect response times ranging from sub-second to minutes. Presto breaks the false choice between having fast analytics using an expensive commercial solution or using a slow “free” solution that requires excessive hardware. A single Presto query can combine data from multiple sources. 

The Presto platform is composed of:

  • Two types of Presto servers: coordinators and workers. 
  • One or more connectors: Connectors link Presto to a data source such as Hive or a relational database. You can think of a connector the same way you think of a driver for a database. 
  • Cost Based Query Optimizer and Execution Engine. Parser. Planner. Scheduler.
  • Drivers for connecting tools, including JDBC. The Presto-cli tool. The Presto Console. 

In terms of organization the community owned and driven PrestoDB project is supported by the Presto Foundation, an independent nonprofit organization with open and neutral governance, hosted under the Linux Foundation®. Presto software is released under the Apache License 2.0.

Ahana offers a managed service for Presto in the cloud. You can get started for free today.

How To Stop Presto

If you are using the presto-admin tool this is how to stop Presto safely:

$ presto-admin server stop

In addition these commands are also useful:

presto-admin server start
presto-admin server restart
presto-admin server status

If you are using Presto on EMR, you can EMR restart Presto with:

sudo stop presto
sudo start presto

(If you have changed any configuration params you should do this on every node where you made a change).

The above “How To Stop Presto” information is correct for PrestoDB. But other “hard forks” of Presto may use different methods to stop and start. 

Presto New Releases

Where is the latest release of PrestoDB? And where can I find the release notes? Where is the documentation? These are common questions with easy answers. Presto’s web site https://prestodb.io/ and GitHub have all the information you need on Presto new releases:

Releases: https://github.com/prestodb for the Presto and Presto-admin repositories.

Release Notes: https://prestodb.io/docs/current/release.html 

Documentation: https://prestodb.io/docs/current/ 

The above list has all the main resources you need for working with Presto’s New Releases. If you’re looking to get up and running quickly with Presto, Ahana Cloud is a SaaS for Presto platform. Presto can be complicated with over 200 parameters to configure and tune – Ahana Cloud takes care of all those; all you do is point and click to get your Presto cluster up and querying your data sources. You can sign up for a free trial and check it out at https://ahana.io/sign-up

How Much Memory To Give A Presto Worker Node

Presto is an in-memory query engine and so naturally memory configuration and management is important. A common question that comes up is how much memory should I give a worker node?

JVM Memory

Presto’s JVM memory config nearly always needs to be configured – you shouldn’t be running Presto with its default setting of 16GB of memory per worker/coordinator. That’s the Presto max.

The Xmx flag specifies the maximum memory allocation pool for a Java virtual machine. Change the  -Xmx16G in the jvm.config file to a number based on your cluster’s capacity, and number of nodes.  See https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html on how to do this.  

Rule of Thumb

It is recommended you set aside 15-20% of total physical memory for the OS.  So for example if you are using EC2 “r5.xlarge” instances which have 32GB of memory,  32GB-20% = 25.6GB so you would use  -Xmx25G in the jvm.config file for coordinator and worker (or  -Xmx27G if you want to go with 15% for the OS).

This is assuming there are no other services running on the server/instance, so maximum memory can be given to Presto. 

Presto Memory 

Like with JVM above, there are two memory related settings that you should check before starting Presto. For most workloads Presto’s other memory settings will work perfectly well when left at their defaults. There are configurable parameters that control memory allocation that could be useful for specific workloads however.  The practical guidelines below will 1) help you decide if you need to change your Presto memory configuration, and 2) which parameters to change.

Workload Considerations

You may want to change Presto’s memory configuration to optimise ETL workloads versus analytical workloads, or for high query concurrency versus single-query scenarios.  There’s a great in-depth blog on Presto’s memory management written by one of Presto’s contributors at  https://prestodb.io/blog/2019/08/19/memory-tracking which will guide you in making more detailed tweaks. 

Configuration Files & Parameters 

When first deploying Presto there are two memory settings that need checking. Locate the config.properties files for both the coordinator and worker. There are two important parameters here: query.max-memory-per-node and query.max-memory.  Again see https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html for rules-of-thumb and how to configure these parameters based on the available memory. 

The above guidelines and links should help you when considering how much memory should you give a worker node. You can see there’s a lot of tuning and config’s to manage – over 200. To avoid all memory configuration work you can use Ahana Cloud for Presto – a fully managed service for Presto, that needs zero configuration. Sign up for a free trial at https://ahana.io/sign-up.

How to Show Tables From All Schemas with Presto

In Presto it is straightforward to show all tables in a schema e.g. If we have a MySQL data source/catalog that has a “demo” schema we use show tables in mysql.demo; but this only reveals the tables managed by that data source.

There is no equivalent way to show all tables in all schemas for a data source. However there’s the metastore to fall back on which we can query: In MySQL, Glue, Hive and others there is a schema called “information_schema” which contains a table called “tables”.  This maintains a list of  schemas and tables relating to that data source.

For a MySQL data source, here’s what Presto shows in a Presto table:

presto> select table_schema, table_name from mysql.information_schema.tables order by 1,2;
    table_schema    |                  table_name                   
 demo               | geography                                     
 demo               | state                                         
 demo               | test                                          
 information_schema | applicable_roles                              
 information_schema | columns                                       
 information_schema | enabled_roles                                 
 information_schema | roles                                         
 information_schema | schemata                                      
 information_schema | table_privileges                              
 information_schema | tables                                        
 information_schema | views                                         
 sys                | host_summary                                  
 sys                | host_summary_by_file_io                       
 sys                | host_summary_by_file_io_type                  

For Ahana’s integrated Hive metastore:

presto:demo> select table_schema, table_name from ahana_hive.information_schema.tables order by 1,2;
    table_schema    |        table_name         
 csv_test           | yellow_taxi_trips         
 csv_test           | yellow_taxi_trips_orc     
 csv_test           | yellow_taxi_trips_staging 
 information_schema | applicable_roles          
 information_schema | columns

This should help you show tables from all schemas.

How To Convert Bigint to Timestamp with Presto

UNIX timestamps are normally stored as doubles. If you have UNIX timestamps stored as big integers then you may encounter errors when trying to cast them as timestamps:

presto> select col1 from table_a;
presto> select cast(col1 as timestamp) from table_a;
Query 20201127_150824_00052_xnxra failed: line 1:8: Cannot cast bigint to timestamp

There is a solution!  Presto’s from_unixtime() function takes a UNIX timestamp and returns a timestamp:

presto> select col1,from_unixtime(col1) as ts from table_a;
    col1    |          ts          
 1606485526 | 2020-11-27 13:58:46.000 
 1606485575 | 2020-11-27 13:59:35.000 

And we can optionally modify the format of the result by using date_format():

presto> select date_format(from_unixtime(col1),'%Y-%m-%d %h:%i%p') from table_a;
 2020-11-27 01:58PM 
 2020-11-27 01:59PM 

That’s how to use from_unixtime() to convert bigint to timestamp. 

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

How do I convert timestamp to date with Presto?

Luckily Presto has a wide range of conversion functions and they are listed in the docs.  Many of these allow us to specifically convert a timestamp type to a date type.

To test this out we can use Presto’s built-in current_timestamp function (an alias for the now() function) that returns the current system time as a timestamp:

presto> select current_timestamp as "Date & Time Here Now";
         Date & Time Here Now          
 2020-11-27 13:20:04.093 Europe/London 
(1 row)

To grab the date part of a timestamp we can simply cast to a date:

presto> select cast(current_timestamp as date) as "Today's date";
 Today's date 
(1 row)

Or we can use date() which is an alias for cast(x as date):

presto> select date(current_timestamp) as "Today's date";
 Today's date 
(1 row)

We can use date_format() which is one of Presto’s MySQL-compatible functions: 

presto:demo> select date_format(current_timestamp, '%Y%m%d') as "Today's date";
 Today's date  
(1 row)

Finally we can use format_datetime() which uses a format string compatible with JodaTime’s DateTimeFormat pattern format:

presto:demo> select format_datetime(current_timestamp, 'Y-M-d') as "Date";
(1 row)

The above 5 examples should allow you to convert timestamps to dates in any scenario.

>> Looking for an easy way to get up and running with Presto? Take a look at Ahana Cloud. <<

Common Questions/Errors:

Why am I getting the following error: ‘>’ cannot be applied to date, varchar (10)

Answer: You are trying to compare a string literal in your query to a date type in your schema. The operator in the error can be any operator >,=,<,>=,<=. To fix that, you need to cast the string to a date type or use a function to parse the string to date.

Example error: select now() as time_now where current_date = '2021-06-01'

Example solution:

Option 1: Cast 
select now() as time_now where current_date > date '2021-06-01'

Option 2: Cast Explicit
select now() as time_now where current_date > Cast ( '2021-06-01' as Date)

Option 3: Parse iso8601 date
select now() as time_now where current_date > from_iso8601_date('2021-06-01')

Option 4: MySQL parser, Non Standard Date format using date_parse 
select now() as time_now where current_date > date_parse('01-06-2021','%d-%m-%Y')

Option 5: Java DataTime for parser, Non Standard Date format using date_parse 
select now() as time_now where current_date > parse_datetime('01-06-2021','dd-MM-YYYY')

How do I convert string with timestamp to a date?

Answer: You are trying to accurately convert a string of the format such as ‘2021-05-13T19:02:57.415006+01:00’ to date and use it in your query for comparison. You need to parse the string using either date_parse or parse_datetime

Example problem: Need to convert the following timestamp to date: ‘2021-05-13T19:02:57.415006+01:00’

Example solution:

Option 1: MySQL parser, Non Standard Date format using 
select date_parse('2021-05-13T19:02:57.415006','%Y-%m-%dT%h:%i:%s.%f') AT TIME ZONE '+01:00'

Option 2: Java DateTime format parser, Non Standard Date format using 
select parse_datetime('2021-05-13T19:02:57.415006+01:00','YYYY-MM-dd''T''HH:mm:ss.SSSSSSZ') 

Looking for related content? How To Convert Date String to Presto Date Format

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

Case Sensitive Search Configuration with Presto

When dealing with character data, case sensitivity can be important when  searching for specific matches or patterns. But not all databases and query engines behave in the same way. Some are case insensitive by default, some are not. How do we configure things so they behave in the way we want?

Here’s an example of why we might need to take steps for case sensitive search configuration. We’re accessing a MySQL database directly:

mysql> select * from state where name='iowa';
| name | id | abbreviation |
| Iowa | 19 | IA           |
1 row in set (0.00 sec)

MySQL is case-insensitive by default. Even though the MySQL column contains the capitalized string ‘Iowa’ it still matched the query’s restriction of ‘iowa’.  This may be acceptable, but in some use cases it could lead to unexpected results.

Using Presto to access the same MySQL data source things behave differently, and arguably, in a more expected way:

presto:demo> select * from state where name='iowa';
 name | id | abbreviation 
(0 rows)
Query 20201120_151345_00001_wjx6r, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:07 [1 rows, 0B] [0 rows/s, 0B/s]
presto:demo> select * from state where name='Iowa';
 name | id | abbreviation 
 Iowa | 19 | IA           
(1 row)

Now we only get a match with ‘Iowa’, and not with ‘iowa’. Presto has essentially made this data source (MySQL) case sensitive, even though it is exactly the same database in both the above examples, with default configurations used.

Reconfigure Case Sensitivity 

With a RDBMS like MySQL you can configure the collation setting to control if you want case sensitivity or not.  You can set the collation at the database creation or table creation level as a part of the CREATE statement. Or you can use ALTER to change the collation of a database, table or individual column. This is described in MySQL’s documentation. 

But how do you change Presto to be case-insensitive? Presto does not support collation, because it is not a database and doesn’t store data. And there is no configuration parameter that controls this.  To manage case sensitivity in Presto, and mimic collation, we rewrite the query to force case insensitivity explicitly by using:

  • simple lower() or upper() functions


select * from state where lower(name)='california';
(3 rows)

This query has matched any upper/lower case combination in the table, mimicking case insensitivity.

Or regular expressions. 


select * from state where regexp_like(name, '(?i)california');
(3 rows)

The regular expression syntax (?i) means matches are case insensitive. 

When it comes to Case Sensitive Search Configuration you are now an eXpErT.

When should I use ORC versus Parquet when using Presto?

If you’re working with open data lakes using open source and open formats, you can have multiple formats. Presto works with both – ORC Presto and Parquet Presto. You’ll probably want to optimize for your workloads. 

Both ORC and Parquet store data in columns. For Presto Parquet, it is most efficient when it comes to storage and performance. ORC on the other hand is ideal for storing compact data and skipping over irrelevant data without complex or manually maintained indices. For example, ORC is typically better suited for dimension tables which are slightly smaller while Parquet works better for the fact tables, which are much bigger.

If you’re looking to get up and running quickly with Presto, you can check out Ahana Cloud. It’s a SaaS for Presto and takes care of all the configuration, tuning, deployment, etc.

What’s the advantage of having your own Hive metastore with Presto? How does it compare to Amazon Glue?

First let’s define what Apache Hive is versus Amazon Glue. Apache Hive reads, writes, and manages large datasets using SQL. Hive was built for Hadoop. AWS Glue is a fully managed ETL service for preparing and loading data for analytics. It automates ETL and handles the schemas and transformations. AWS Glue is serverless, so there’s no infrastructure needed to provision or manage it (you only pay for the resources used while your jobs are running).

Presto isn’t a database and does not come with a catalog, so you’d want to use Hive to read/write/manage your datasets. Presto abstracts a catalog like Hive underneath it. You can use the Glue catalog as the default Hive metastore for Presto.

With Ahana Cloud, you don’t really need to worry about integrating Hive and/or AWS Glue with Presto. Presto clusters created with Ahana come with a managed Hive metastore and pre-integrated Amazon S3 data lake bucket. Ahana takes care connecting external catalogs like Hive and Amazon Glue, so you can focus more on analytics and less on integrating your catalogs manually. You can also create managed tables as opposed to external tables.

How to Find Out Data Type of Value with Presto

Presto has a typeof() function to make finding out data types of values easy. This is particularly useful when you are getting values from nested maps for example and the data types need to be determined.

Here’s a simple example showing the type returned by Presto’s now() function:

presto:default> select now() as System_Timestamp, typeof( now() ) as "the datatype is";
           System_Timestamp            |     the datatype is      
 2020-11-18 15:15:09.872 Europe/London | timestamp with time zone 

Some more examples:

presto:default> select typeof( 'abc' ) as "the datatype is";
 the datatype is 
presto:default> select typeof( 42 ) as "the datatype is";
 the datatype is 
presto:default> select typeof( 9999999999 ) as "the datatype is";
 the datatype is 
presto:default> select typeof( 3.14159 ) as "the datatype is";
 the datatype is 

Armed with this info you should now be able to find out the data types of values.

How to Rotate Rows to Columns with Presto

Sometimes called pivoting, here is one example of how to switch columns to rows via rotation with Presto.  

Suppose we have rows of data like this:

'a', 9
'b', 8
'a', 7 

We want to pivot this data so that all the ‘a’ row values are arranged in one column, and all the ‘b’ row values are in a second column like this:


To rotate from rows to columns we will add an id to make aggregation easy. We will name the output columns a and b for the Presto key, and we’ll include the id in our result set. This is how we do the rotation in Presto, using VALUES() to supply the test data, and simple conditional CASE WHEN END logic:

presto:default> SELECT id
, MAX(CASE WHEN key = 'a' THEN value END) AS a
, MAX(CASE WHEN key = 'b' THEN value END) AS b 
FROM (VALUES (1, 'a', 9), (2, 'b', 8), (3, 'a', 7 )) as test_data (id, key, value) 
 id |  a   |  b   
  1 |    9 | NULL 
  2 | NULL |    8 
  3 |    7 | NULL 
(3 rows)

There are other SQL options for transforming (pivoting) rows into columns – you can use the map_agg function for example.

The code sample and description here should help when you need to rotate data from rows to columns using Presto.

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

How do you rotate columns to rows with Presto?

Sometimes called unpivoting, here is one example of how to rotate column data with Presto.  

Suppose we have some integer data arranged in two columns called a and b:


We want to rotate the columns into rows like this, where for every ‘a’ column value we now see a row labeled ‘a’, and ditto for the b values:

'a', 9
'b', 8
'a', 7

To rotate from columns to rows in Presto we’ll use a CTE and VALUES() to supply the test data, and simple conditional CASE WHEN END logic coupled with a sub-select and a UNION:

presto:default> with testdata(value_a, value_b) as ( VALUES (9,null), (null,8), (7,null) ) 
select key, value from 
(select 'a' as key, value_a as value 
from testdata 
UNION select 'b' as key, value_b as value 
from testdata) 
where value is not null;
 key | value 
 a   |     9 
 b   |     8 
 a   |     7 
(3 rows)

There are other SQL options for rotating (unpivoting) columns into rows: The CROSS JOIN unnest function is similar to LATERAL VIEW explode function.

The code sample and description here should help when you need to rotate data from columns to rows using Presto.

What are the operational benefits of using a managed service for Presto with Ahana Cloud?

First let’s hear from an AWS Solution Architect: “Ahana Cloud uses the best practices of both a SaaS provider and somebody who would build it themselves on-premises. So the advantage with the Ahana Cloud is that Ahana is really doing all the heavy lifting, and really making it a fully managed service, the customer of Ahana does not have to do a lot of work, everything is spun up through cloud formation scripts that uses Amazon EKS, which is our Kubernetes Container Service. The customer really doesn’t have to worry about that. It’s all under the covers that runs in the background. There’s no active management required of Kubernetes or EKS. And then everything is deployed within your VPC. So the VPC is the logical and the security boundary within your account. And you can control all the egress and ingress into that VPC. So you, as the customer, have full control and the biggest advantage is that you’re not moving your data. So unlike some SaaS partners, where you’re required to push that data or cache that data on their side in their account, with the Ahana Cloud, your data never leaves your account, so your data remains local to your location. Now, obviously, with federated queries, you can also query data that’s outside of AWS. But for data that resides on AWS, you don’t have to push that to your SaaS provider.”

Now that you have that context, lets get more specific, let’s say you want to create a cluster initially, it’s a just a couple of clicks with Ahana Cloud. You can pick the the coordinator instance type and the Hive metastore instance type and it is all flexible. Instead of using the Ahana-provided Hive metastore, you can bring your own Amazon Glue catalog. Then of course its easy to add data sources. For that, you can add in JDBC endpoints for your databases. Ahana has those integrated in and then Ahana Cloud automatically restarts the cluster.

Compared to EMR or if you’re running with other distributions, all of this has to be done manually:

  • you have to create a catalog properties file for each data source
  • restart the cluster on your own
  • scale the cluster manually
  • add your own query logs and statistic
  • rebuild everything when you stop and restart clusters

With Ahana, all of this manual complexity is taken away. For scaling up, if you want to grow the analytics jobs over time, you can add nodes seamlessly. Ahana Cloud and other distributions can add the nodes to the cluster while your services are still up and running. But the part that isn’t seamless is when you stop the entire cluster. In addition to all the workers and the coordinator being provisioned, the configuration and the cluster connections to the data sources, and the Hive metastore are all maintained with Ahana Cloud. And so when you restart the cluster back up, all of that comes up pre-integrated with the click of a button: the nodes get provisioned again, and you have access to that same cluster to continue your analytics service. This is very important, because otherwise, you would have to manage it on your own, including the configuration management and reconfiguration of the catalog services. Specifically for EMR, for example, when you terminate a cluster, you lose track of that cluster altogether. You have to start from scratch and reintegrate the whole system.

How to Lateral View Explode in Presto

Hive’s explode() function takes an array (or a map) as input and outputs the elements of the array (map) as separate rows. Explode is a built-in Table-Generating Function (UDTF) in hive and can be used in a SELECT expression list and as a part of LATERAL VIEW.

The explode function doesn’t exist in Presto; instead we can use Presto’s similar UNNEST. 

Here’s an example using test results data in json form as input, from which we compute the average score per student.  We use the WITH clause to define a common table expression (CTE) named example with a column alias name of data. The VALUES function returns a table rowset. 

WITH example(data) as 
    (json '{"result":[{"name":"Jarret","score":"90"},{"name":"Blanche","score":"95"}]}'),
    (json '{"result":[{"name":"Blanche","score":"76"},{"name":"Jarret","score":"88"}]}')
SELECT n.name as "Student Name", avg(n.score) as "Average Score"
FROM example
    UNNEST ( 
        CAST (JSON_EXTRACT(data, '$.result')
        as ARRAY(ROW(name VARCHAR, score INTEGER )))
    ) as n
--WHERE n.name='Jarret'
GROUP BY n.name;

Student Name | Average Score 
 Jarret      |          89.0 
 Blanche     |          85.5 
(2 rows)

The UNNEST function takes an array within a column of a single row and returns the elements of the array as multiple rows.

CAST converts the JSON type to an ARRAY type which UNNEST requires.

JSON_EXTRACT uses a jsonPath expression to return the array value of the result key in the data.

This code sample and description should help when you need to to do a lateral view explode in Presto.

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

How do you cross join unnest a JSON array?

Let’s take a look at how you can cross join unnest a JSON array by doing Presto joins. Here’s an example using test results data in json form as input, from which we compute the average score per student.  We use the WITH clause to define a common table expression (CTE) named example with a column alias name of data. The VALUES function returns a table rowset. 

WITH example(data) as 
    (json '{"result":[{"name":"Jarret","score":"90"},{"name":"Blanche","score":"95"}]}'),
    (json '{"result":[{"name":"Blanche","score":"76"},{"name":"Jarret","score":"88"}]}')
SELECT n.name as "Student Name", avg(n.score) as "Average Score"
FROM example
    UNNEST ( 
        CAST (JSON_EXTRACT(data, '$.result')
        as ARRAY(ROW(name VARCHAR, score INTEGER )))
    ) as n
--WHERE n.name='Jarret'
GROUP BY n.name;

Student Name | Average Score 
 Jarret      |          89.0 
 Blanche     |          85.5 
(2 rows)

The UNNEST function takes an array within a column of a single row and returns the elements of the array as multiple rows.

CAST converts the JSON type to an ARRAY type which UNNEST requires.

JSON_EXTRACT uses a jsonPath expression to return the array value of the result key in the data.

The UNNEST approach is similar to Hive’s explode function.

This code sample and description should help when you need to execute a cross join to unnest a JSON array. 

>> Looking for an easy way to get up and running with Presto? Take a look at Ahana Cloud. <<

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

How can you write the output of queries to S3 easily?

With Ahana Cloud, we’ve made it easy for you write output of queries to S3. While there’s a variety of formats, here’s an example:

presto> CREATE SCHEMA ahana_hive.s3_write WITH (location = 's3a://parquet-test-bucket/');

presto> CREATE TABLE ahana_hive.s3_write.my_table
WITH (format = 'PARQUET')
AS SELECT <your query here> ;

Does Amazon Athena do joins across other data sources besides S3? Does Amazon Athena connect to other data sources?

With Amazon Athena you’re limited in scope when it comes to doing joins across other data sources like relational data systems and more. You have to set up a Lambda, which then connects with your database in the back, which is an additional piece you have to manage on your own.

With Ahana Cloud, you can directly connect to many different types of data sources including S3, MySQL, PostgreSQL, Redshift, Elastic, and more with a few clicks, and you’re ready to integrate with any cluster.

If I have catalogs connected and configurations attached to my Presto cluster, what happens when I take the cluster down?

If you’re managing Presto on your own, either through your own installation or through a service like AWS EMR or AWS Athena, you have to maintain and manage all of the catalogs and configurations attached to your cluster. That means that if you take your cluster down, you’ll lose those catalogs and configurations – they are not maintained for your cluster.

You can use the Ahana Cloud managed service for Presto to help with this. Ahana Cloud manages all of that for you, so you don’t have to worry about losing catalogs and configurations attached to your Presto cluster. You also get Presto bundled with data sources like the Hive metastore, Apache Superset, and more.

Check If Map Or Presto Array Is Empty Or Contains

When working with array (indexable lists) or map (key-value tuple) complex types in Presto, it is useful to be able to check if map or array is empty. The examples below should enable you to check if a map or array is empty.


First, let’s create a prestomap from an array of key values:

presto> select map(array['taco','donut'], array[1,2]) as myMap;
 {donut=2, taco=1} 

We can use the CARDINALITY() function to check if the map contains anything:

presto> select cardinality(map(array['taco','donut'], array[1,2])) = 0 as is_empty;

presto> select cardinality(map(array[], array[])) = 0 as is_empty;

We can also compare a map with an empty map() to test if it is empty:

presto> select (map(array['taco'], array['donut']) = map()) as is_empty;


Again, we can use the CARDINALITY() function to check if an array contains anything. Here’s an example using an array with 3 elements:

presto> SELECT cardinality(ARRAY['Ahana', 'Cloud', 'Presto']) = 0 as is_empty;

presto> SELECT cardinality(ARRAY[]) = 0 as is_empty;

 Another method to check if an array is empty is to compare it with array[] like this:

presto> SELECT (map_keys(map(ARRAY['Ahana'],ARRAY['Presto']))= array[]) as is_empty;

Tip: Be aware of null values. If an array contains nulls it is not considered to be empty:

presto> SELECT cardinality(ARRAY['', '', '']) = 0 as is_empty;

presto> SELECT cardinality(ARRAY[null,null,null]) = 0 as is_empty;

The above query examples should enable you to check if a map or array is empty.

If you want to get up and running with Presto, Ahana Cloud’s cloud managed service built for AWS is the easiest way to do that. See our docs for more details.

Export Result Of Select Statement in Presto

A common question is “how can I run a query and export result of select statement quickly and easily in Presto?”  You are in luck, as several solutions exist – your choice will depend on your preferred tool, the output format you want, and the size of the result-set. Here are two options.

Using Presto-CLI

If you are using the Presto command line tool presto-cli (or just presto on the Mac if you used brew install presto) then use the --output-format parameter and redirect the output to a regular file. For example:

$ presto --server https://myPrestoCluster.com \
--catalog glue --schema amazon \
--execute "select product_title, count(*) as Num_Reviews from reviews where lower(product_title) like '%tacos%' group by 1 order by 2 desc limit 20" \
--output-format CSV > top20_taco_reviews.csv

$ head top20_taco_reviews.csv
"Dragons Love Tacos","832"
"Vitacost Extra Virgin Certified Organic Coconut Oil 16 Oz","281"
"Vitacost 100% Pure Peppermint Oil -- 4 fl oz","178"
"Vitacost 100% Pure Lavender Oil -- 4 fl oz","168"
"Taco Tender Holder - Plastic Red Stand - Holds 3 Tacos","106"
"Vitacost Infant Health - Baby-D's Liquid Vitamin D Drops -- 400 IU - 1 fl oz","101"
"Vitacost Butterbur Extract - Standardized -- 75 mg - 120 Capsules","84"
"Tacos, Tortas, and Tamales: Flavors from the Griddles, Pots, and Streetside Kitchens of Mexico","63"
"Vitacost Pine Bark Extract - Standardized to 95% OPC -- 100 mg - 300 Capsules","51"
"Vegan Tacos: Authentic and Inspired Recipes for Mexico's Favorite Street Food","45"

There are several formats supported by Presto-CLI, the default being quoted CSV:

--output-format <output-format>
            Output format for batch mode [ALIGNED, VERTICAL, CSV, TSV,
            CSV_HEADER, TSV_HEADER, NULL] (default: CSV)

So if you want to see column headers in your CSV format output file use --output-format CSV_HEADER

The advantage of using this approach is speed.

Using a Database Tool

If you are using a third-party SQL database tool like DbVisualizer, DBeaver or SQuirreL SQL then the UI will allow you to save the query output. For example, using DbVisualizer, run your query and click the Export button.

The advantage of this method is the huge number of output formatting options on offer. The disadvantage is it is usually slower than using Presto-CLI. 

The two options above should help you export results of a Select statement.

How to Manage Presto Queries Running Slow

There are a few reasons that Presto performance is impacted and queries might be running slow. Below we’ll share some things to do in terms of diagnosis and Presto tuning, as well as possible solutions to address the issue. These can help with performance and Presto usage.

  1. How many workers do you have in your cluster? If your PrestoDB cluster has many (>50) workers then depending on workload and query profile, your single coordinator node could be overloaded. The coordinator node has many duties, like parsing, analyzing, planning and optimizing queries, consolidating results from the workers, task tracking and resource management. Add to that the burden of all the internal communication with the other nodes in the cluster being fairly heavyweight JSON over http and you can appreciate how things could begin to slow down at scale. (Note Presto projects like the “disaggregated coordinator” Fireball project aim to eliminate Presto’s  single coordinator bottleneck).  In the meantime try increasing the resources available to the Coordinator by running it on a larger cloud instance, as more CPU and memory could help. You may also run into issues if there’s Presto multiple users.
  2. Have you configured Presto and memory usage correctly?  It is often necessary to change the default memory configuration based on your cluster’s capacity.  The default max memory for a Presto server is 16 GB, but if you have a lot more memory available, you may want to allocate more memory to Presto for better performance. See https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html for configuration details. One rule of thumb:  In each node’s jvm.config, set -Xmx to 80% of the available memory initially, then adjust later based on your monitoring of the workloads.
  3. What kind of instances are your worker nodes running on – do they have enough I/O? Picking the right kind of instance for worker nodes is important.  Most analytical workloads are IO intensive so the amount of network IO available can be a limiting factor. Overall throughput will dictate query performance. Consider choosing higher Network IO instances for the workers – for example on AWS you can do this by looking at each instance type’s “network performance” rating – here are the ratings for the m4 type:
  • Optimize your metadata / data catalog:  Using Presto’s Hive connector for your metastore, like many users do, will mean practically every query will access the Hive metastore for table and partition details etc.  During peak time that generates a high load on the metastore which can slow down query performance. To alleviate this consider:
    • Setup multiple catalogs. Configure PrestoDB to use multiple thrift metastore endpoints – Presto’s Hive connector supports configuring multiple hive metastore endpoints which are tried in round-robin by the coordinator. https://prestodb.io/docs/current/connector/hive.html 
    • Enable Hive metastore and carefully tweak cache eviction configurations and TTLs suitable for your data refresh policies
  1. Do you have a separate coordinator node? With PrestoDB you can have a single node act as both a coordinator and worker, which can be useful for tiny clusters like sandboxes for testing purposes but it’s obviously not optimal in terms of performance.  It is nearly always recommended to have the coordinator running on a separate node to the workers for anything other than sandbox use.  Tip:  Check your nodes’ Presto etc/config.properties files to determine which one is the coordinator (look for coordinator=true)
  2. Is memory exhausted? Presto uses an in-memory, pipelining processing architecture and its operation is dependent on the available JVM which in turn is dependent on how much memory Presto is configured to use and how much memory is physically available in the server or instance it is running in.   
    • The workers can be memory hungry when processing very large queries. Monitor their memory usage and look for failed queries. Allocate more memory if necessary and switch to using a more memory-rich machine if practical. 
    • The coordinator should be allocated a significant amount of memory – often more than a worker – depending on several factors like workload, the resources available, etc.  It’s not uncommon to see the coordinator alone consuming several tens’ of GBs of memory. 
    • The good news is there is memory information available in at least two places:
      • Presto’s built-in JMX catalog can help your monitor memory usage with various counters.  Read more about memory pools, limits and counters at https://prestodb.io/blog/2019/08/19/memory-tracking
      • There is also the Presto Console which reveals, for each query, the reserved, peak and cumulative memory usage.
  1. When was the last time you restarted your Presto cluster? Sometimes, restarting any kind of  software can solve all sorts of issues, including memory leaks and garbage collection.
  2. Is your Presto cluster configured for autoscaling based on CPU usage?  If so check the configuration is what you expect it to be.
  3. Does IO and CPU utilisation look balanced?  Check CPU usage on Presto workers: if their CPUs are not fully saturated, it might indicate the number of Presto worker threads can be made higher, or the number of splits in a batch is not high enough.
  4. Have you checked your data volumes recently? An obvious one to check but data volumes can grow in fits and starts and sometimes peaks occur  unexpectedly. The queries may simply be taking longer because there is x% more data than last month.

Other configuration settings for Task concurrency, initial splits per node, join strategy, driver tasks… PrestoDB has around 82 system configurations and 50+ hive configuration settings which users can tweak, many at the query level. These are however for advanced users, which falls outside the scope of this article. More information can be found in the PrestoDB documentation.

As you can tell, there’s a lot to configure and tune when it comes to addressing Presto performance issues. To make it easier, you can use Ahana Cloud, SaaS for Presto. It’s available in the AWS Marketplace and is pay as you go. Check out our free trial at https://ahana.io/sign-up

Extract Keys in a Nested JSON Array Object With Presto

Let’s say we have some JSON that looks like this:

[{"id": 1, "value":"xxx"}, {"id": 2, "value":"yyy"}]

In this instance our json contains key value pairs. How can we decode that using PrestoDB SQL to extract only the keys (1,2 in this example)?  

We can extract keys in a nested JSON array object using SQL arrays, the JSON_PARSE function and the TRANSFORM function.  The TRANSFORM function’s syntax is:

transform(array(T), function(T, U)) -> array(U)

Returns an array that is the result of applying function to each element of array:

For this example query we’ll also use the VALUES function to provide the test data – VALUES is a handy way to define an inline  table – kind of like a dummy table.  

TRANSFORM(CAST(JSON_PARSE(array) AS ARRAY<MAP<VARCHAR, VARCHAR>>), entry -> entry['id']) as My_Keys 
from (values ('[{"id": 1, "value":"xxx"}, {"id":2, "value":"yy"}]')) t(array);

 [1, 2]  

This works by first creating an array containing the input JSON data called ‘array’.  We then use JSON_PARSE which converts a string or array containing JSON text into deserialized JSON values, and we cast those values as an array with two varchar elements.   TRANSFORM then returns an array containing only the elements we want  (we call our desired elements ‘entry’ in this example but it could be called anything).

So with Presto’s rich set of array and JSON functions we can see how to easily extract keys in a nested JSON array object. If you want to get up and running with Presto, Ahana Cloud’s cloud managed service built for AWS is the easiest way to do that. See our docs for more details.

Generate Interval from Variable Using PrestoDB

There are numerous ways to generate interval from variable using PrestoDB SQL for different data types.


Date intervals can be generated for a given date range using a combination of the following not-so-well-known SQL functions:

  • SEQUENCE() – generates a sequence of values between a given start and stop range. It generates an ARRAY type. Function syntax is:

sequence(start, stop, step) -> array(timestamp)

Generate a sequence of timestamps from start to stop, incrementing by step. The type of step can be INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH. Using step can choose the interval we need e.g. INTERVAL ‘1’ DAY

Note that there is an array size upper limit of 10000.

  • FROM_ISO8601_DATE() – Parses a ISO 8601 formatted string into a date.
  • VALUES() – defines an inline / dummy table.  You can use VALUES() anywhere a query can be used e.g. as the FROM clause of a SELECT, an INSERT
  • UNNEST() – this function is used to expand an ARRAY into a single column. 

Example 1

Here’s an example query showing how to generate dates for the first 2 weeks of June with an interval of 1 day .  We create an array with the sequence of dates we need called date_array, then convert the array into a series of values with a join.

SELECT CAST(date_column AS DATE) date_column 
  VALUES (SEQUENCE(FROM_ISO8601_DATE('2020-06-01'),    
                   FROM_ISO8601_DATE('2020-06-14'), INTERVAL '1' DAY) ) ) AS t1(date_array) 
CROSS JOIN UNNEST(date_array) AS t2(date_column) ;

(14 rows)

Example 2

Here’s a slightly different approach this time using the DATE_ADD function. We specify a start date in a table and the result set begins with this given date plus the next n (10 in this example) dates. We’re still using an array but the array from the SEQUENCE function is fed straight to the UNNEST function.

create table table1(start_date date);

insert into table1 values date('2020-10-13');

select DATE_ADD('day',s.n,t.start_date) as date from table1 t cross join UNNEST(SEQUENCE(0,10)) s (n);
(11 rows)



Here’s a simple example generating a sequence of integers from 20 to 0 in steps of -2:

 SELECT x  FROM UNNEST(sequence(20, 0, -2)) t(x);
(11 rows)

These examples should help you when trying to generate an interval from a variable.

How to Presto Escape a Single Quote

In reality there will be some occasions when you need to use a quote character in your data, in your query, or in result sets, but you want the SQL interpreter to treat the quote as “just another character”, as opposed a quote which has special meaning – namely to denote a string literal in most SQL dialects. This is called escaping a character.

Some SQL dialects use a backslash as the escape character. Presto’s SQL needs no escape character; instead it uses a double character notation. Here’s a simple example of what I mean:

presto:default> select 'hello world';
 hello world

What if I want single quotes to actually appear in the result?  I need to do this in Presto:

presto:default> select '''hello world''' as col1;
 'hello world' 

I have simply used two single quotes to tell the interpreter the places where “I really want to include a single quote”. 

Another example: What if there are single quotes in my table? 

presto:default> select * from mytable where column_a in ('Driver's License');

Query n failed: line 1:43: mismatched input 'in'. Expecting: 'AND', 'EXCEPT', 'GROUP', 'HAVING', 'INTERSECT', 'LIMIT', 'OR', 'ORDER', 'UNION', <EOF>

The query was rejected because the interpreter encountered an unexpected single quote in “Driver’s License”.  To handle this we use two single quotes:

presto:default> select * from mytable where column_a in ('Driver''s License');

(query runs ok)

So that’s how to escape a single quote in Presto.

Check if null and return default (NVL analog)

SQL has a number of functions for dealing with null values, and everyone has their favorite. The complicating factor is that not all SQL engines support all the functions, and some SQL dialects have their own versions of the functions which may be completely different (remember the “S” in SQL stands for Structured, not Standard!)

Let’s look at the functions used to substitute alternative, or default, values for NULLs. These functions are really useful if you need to see zeros instead of nulls in your results, or if you need to avoid division by zero errors.  We’ll check if null and return default (nvl analog).

Oracle has the NVL(column_x, value_y) function which will return column_x UNLESS it is a null, in which case the value_y will be returned instead. 

In Microsoft SQL Server, the function is ISNULL(column, value)

In MySQL, you use IFNULL(column, value)

In most databases (MySQL included), you can use COALESCE(value1, value2, …) which returns the first non-null value in the list.

And in some databases, you will find the function NULLIF(value1, value2) which returns null if value1 equals value2, otherwise returns value1.

And in most databases, you can use good old CASE, for example:

           WHEN a = 999 THEN 'this is too much'
           WHEN a is null THEN 'this is null'
           ELSE 'we are good'

What about PrestoDB’s support when it comes to handling nulls?  PrestoDB  supports the following:

COALESCE(value1, value2, …)returns the first non-null value (or column) in the list.
NULLIF(value1, value2) returns null if value1 equals value2, otherwise returns value1.
CASE..WHEN..END various forms supported – see https://prestodb.io/docs/current/functions/conditional.html for more information. Use with IS NULL to match null values. 
IF(condition, true_value, false_value)Evaluates and returns true_value if condition is true, otherwise false_value. Use with IS NULL to match null values. 

You’ll notice NVL() is not supported by PrestoDB. Use COALESCE(column_x, value_y) instead as a direct replacement. Or use IF(), or CASE statement. 

Some examples using PrestoDB:

--Check we have a null in our data
presto:default> select partkey, brand, container from parts_null_test where partkey=200001;
 partkey |   brand    | container 
  200001 | SuperBrand | NULL      

--Turn nulls into the string ‘Single’ using coalesce
presto:default> select partkey, brand, COALESCE(container,'SINGLE') as CONTAINER from parts_null_test where partkey=200001;
 partkey |   brand    | CONTAINER 
  200001 | SuperBrand | SINGLE    

--Use the IF() method to turn nulls into the string ‘Single’
presto:default> select partkey, brand, IF(container is null,'SINGLE', container) as CONTAINER from parts_null_test LIMIT 5;
 partkey |   brand    | CONTAINER  
  200001 | SuperBrand | SINGLE     
  100001 | Brand#35   | JUMBO CASE 
  100002 | Brand#34   | WRAP CAN   
  100003 | Brand#21   | SM BOX     
  100004 | Brand#42   | SM CASE    
--Use a CASE statement
presto:default> select partkey, brand, CASE WHEN container is null THEN 'SINGLE' ELSE container END as CONTAINER from parts_null_test LIMIT 5;
 partkey |   brand    | CONTAINER  
    7024 | Brand#52   | JUMBO PACK 
    7025 | Brand#15   | LG PACK    
    7026 | Brand#55   | LG BOX     
    7023 | Brand#22   | LG DRUM    
  200001 | SuperBrand | SINGLE     

The above information should help you to check if null and return default (NVL analog) using the Presto if, coalesce, and case statements ( COALESCE(), IF() or CASE) in your PrestoDB SQL queries.

Static date and timestamp in where clause

In this post we’ll look at the static date and timestamp in where clause when it comes to Presto. Many databases automatically convert between CHAR or VARCHAR and other types like DATE and TIMESTAMP as a convenience feature.  Using constants in a query are also often auto-converted.  Take these example queries which count rows where the table’s DATE column is equal to a specific date, or range of times:

Select count(*) from myTable where myDateCol = '2020-09-28';

Select count(*) from myTable where myTimeCol between '2020-09-28 14:00:00.000' and '2020-09-28 14:59:59.000';

The underlying table has columns defined as DATE and TIMESTAMP types.  But the query is actually providing date and timestamp literals in the form of strings!  Some databases will tolerate this, but Presto is stricter in this respect and will give you an error like this:

presto:sf1 > select count(*) from orders where orderdate < '2020-09-01';
Query failed: line 1:45: '<' cannot be applied to date, varchar(10)

With Presto you must either cast your data types or a slightly simpler way is to use the date or timestamp type constructors:

$ presto --schema sf1 --catalog tpch

presto:sf1> select count(*) from orders where orderdate < date '2020-09-01';
(1 row)

> select count(*) from transactions where myTimestampCol between timestamp '2020-09-01 22:00:00.000' and timestamp '2020-09-01 22:59:59.000';
(1 row)

--Using CAST() also works:
presto:sf1> select count(*) from orders where orderdate < CAST('2020-09-01' as DATE);
(1 row)

In summary, by using CAST() or the DATE/TIMESTAMP type constructors you will be able to specify a static date and timestamp in where clause.

Where can I find the Presto Server Bootstrap logs?

A common question is “where can I find the Presto server bootstrap logs?”  These are logs that indicate what is happening to the cluster on startup. The Presto server generates them, and the content looks like this:

2020-05-22T11:59:32.045-0400 INFO main io.airlift.log.Logging Logging to stderr
2020-05-22T11:59:32.049-0400 INFO main Bootstrap Loading configuration
2020-05-22T11:59:32.224-0400 INFO main Bootstrap Initializing logging
2020-05-22T11:59:33.268-0400 INFO main Bootstrap PROPERTY DEFAULT RUNTIME DESCRIPTION
2020-05-22T11:59:33.268-0400 INFO main Bootstrap discovery.uri null http://localhost:8080 Discovery service base URI
2020-05-22T11:59:33.268-0400 INFO main Bootstrap service-inventory.uri null null Service inventory base URI
2020-05-22T11:59:33.268-0400 INFO main Bootstrap service-inventory.update-interval 10.00s 10.00s Service inventory update interval
2020-05-22T11:59:33.268-0400 INFO main Bootstrap discovery.presto.pool general general
2020-05-22T11:59:33.268-0400 INFO main Bootstrap discovery.collector.pool general general
2020-05-22T11:59:33.270-0400 INFO main Bootstrap discovery.max-age 30.00s 30.00s
2020-05-22T11:59:33.270-0400 INFO main Bootstrap discovery.store-cache-ttl 1.00s 1.00s
2020-05-22T11:59:33.270-0400 INFO main Bootstrap dynamic.store.gc-interval 1.00h 1.00h
2020-05-22T11:59:33.271-0400 INFO main Bootstrap dynamic.store.remote.max-batch-size 1000 1000

In the above excerpt you can see Bootstrap-specific lines.

There are 3 logs of interest :

server.logThis is the main log file used by Presto. If the server fails during initialization, it usually contains relevant information. It will automatically rotate and compress.
launcher.logThis log is created by the launcher and is connected to the server’s stdout and stderr streams. It will contain some log messages that occurred while initializing server logging and any error or diagnostic information generated by the JVM.
http-request.logThis is the HTTP request log, which contains every HTTP request received by the server. It will automatically rotate and compress.

The log files’ location depends on where you are running your PrestoDB cluster and how you have it configured:

  • If you are running on EC2 look here: /media/ephemeral0/presto/var/log/bootstrap.log 
  • Other places to check (not just for the server log) are: /var/log/ and /var/lib/presto/data/var/log/ and /var/presto/data/

If all else fails, look in your config.properties file (look in /opt/presto-server/etc/) to see if there is anything deployment-specific configured for logging. 

Configuration tip: Presto needs a data directory for storing logs, etc. and it is recommended this is created in a data directory outside of the installation directory (which allows it to be easily preserved when upgrading Presto). 

  • Presto’s node properties file etc/node.properties sets the location of the data directory. 
  • In your node.properties file set the node.data-dir property to point to the location (filesystem path) of the intended data directory – Presto will store logs and other data here. 

Finally, if you are wondering where to start when diagnosing bootstrap issues, it is worth looking at the output in the bootstrap section of the server.log file first. 

Armed with the above information you should now be able to find the Presto Server Bootstrap logs.

Where can I find different Presto metrics for monitoring?

If you’re wondering “Where can I find different Presto metrics for monitoring?”, we’ll help explain it. There are several ways to monitor Presto. Let’s look at some options.

1. Presto Console

Presto provides a web interface for monitoring and managing queries. The web interface is accessible on the Presto coordinator via HTTP, using the HTTP port number specified in the coordinator Config Properties (default is 8080). The console’s main page looks like this:

This main page has a list of queries along with information like unique query ID, query text, query state, percentage completed, username and source from which this query originated. The currently running queries are at the top of the page, followed by the most recently completed or failed queries.  A query can have one of several states:

  • QUEUED – Query has been accepted and is awaiting execution.
  • PLANNING – Query is being planned.
  • STARTING – Query execution is being started.
  • RUNNING – Query has at least one running task.
  • BLOCKED – Query is blocked and is waiting for resources (buffer space, memory, splits, etc.).
  • FINISHING – Query is finishing (e.g. commit for autocommit queries).
  • FINISHED – Query has finished executing and all output has been consumed.
  • FAILED – Query execution failed.

The following console screenshot shows an example of an executed Presto query example; here’s a breakdown of the stats:

  • The query was run by the root user, using the presto-cli tool. 
  • 54 splits were completed during execution.
  • It took 5.17 seconds (wall clock time) to run. 
  • The query consumed 19.7 secs of CPU time – this is greater than the wall clock time since there are multiple CPUs and multiple cores at work.  
  • In terms of (JVM) memory, the query used up to 59.2MB during execution.  
  • The “Cumulative User Memory” (35.5M) is the sum of all (user) memory consumption across all query stages from queryStats. The unit of this metric is M seconds which means the memory data size consumed by the user to execute the query in total, excluding the memory consumption of the system:

Click on the query ID link (20200925 in this example) and you will see a LOT more detail:

Notice the “Live Plan” tab at top-right which will give you a graphical representation of the query plan, which you read from the bottom up:

The plan is clickable – click a stage and you can drill down into more detail. 

2. JMX

Java Management Extensions (JMX) provides information about the Java Virtual Machine and all of the software running inside it. 

JMX is actually a connector which has been configured so that chosen JMX information will be periodically dumped and stored in tables (in the “jmx” catalog)  which can be queried. Note this can be controlled in the properties file: /etc/catalog/jmx.properties

JMX is useful for monitoring and debugging Presto. 

Here’s how to query it using presto-cli:

$ presto --schema jmx --catalog jmx
> select * from jmx.information_schema.tables;   -- lists tables in the information schema

> show tables from jmx.information_schema; 	    -- another way of listing tables

> select * from jmx.information_schema.views;    -- lists all views 

The most useful JMX schema for monitoring is “current” which contains every MBean from every node in the Presto cluster.  The following query uses the “current” schema to return information from the different Presto memory pools on each node:

> SELECT freebytes, node, object_name FROM jmx.current."com.facebook.presto.memory:*type=memorypool*";

 freebytes |     node     |                       object_name                        
 322122547 | 4768d52a6258 | com.facebook.presto.memory:type=MemoryPool,name=reserved 
 429496730 | 4768d52a6258 | com.facebook.presto.memory:type=MemoryPool,name=general  
(2 rows)

More info on monitoring can be found in the docs – check out this blog for details on memory-related monitoring, management and memory pools: https://prestodb.io/blog/2019/08/19/memory-tracking 


You can make a simple REST call to Presto via the Presto REST API to get a dump of recently run queries using:


The response is in JSON format.

 You can optionally specify a query ID – in this example my query ID is 20200926_204458_00000_68x9u:


4. Third-Party Tools

You can also monitor Presto using third-party tools like Datadog, Prometheus, etc.

The above is not an exhaustive list, but I hope it helps you find different Presto metrics for monitoring.

Where to Find Presto Logs

If you’re wondering “where do I find Presto logs”, we’ll help explain it. Presto needs a data directory for storing logs, etc. and it is recommended this is created in a data directory outside of the installation directory (which allows it to be easily preserved when upgrading Presto). Presto’s node properties file etc/node.properties sets the location of the data directory. Set the node.data-dir property to point to the location (filesystem path) of the intended data directory – Presto will store logs (Presto logs EMR) and other data here. 

For example, my environment is set to log to /var/lib/presto/data/var/log and in here we find http request and other logs: 

$ ls -al /var/lib/presto/data/var/log
-rw-r--r-- 1 root root 5202277 Sep 25 21:53 http-request.log
-rw-r--r-- 1 root root  238037 Sep 12 01:23 http-request.log-2020-09-11.0.log.gz
-rw-r--r-- 1 root root  636367 Sep 13 00:00 http-request.log-2020-09-12.0.log.gz
-rw-r--r-- 1 root root  583609 Sep 14 00:00 http-request.log-2020-09-13.0.log.gz
-rw-r--r-- 1 root root  297803 Sep 16 12:53 http-request.log-2020-09-14.0.log.gz
-rw-r--r-- 1 root root  225971 Sep 17 00:00 http-request.log-2020-09-16.0.log.gz
-rw-r--r-- 1 root root  221244 Sep 25 10:43 http-request.log-2020-09-17.0.log.gz

Note that logging levels are controlled by the etc/log.properties file. The level can be set to one of four levels: DEBUG, INFO, WARN and ERROR. 

Query Logs: Running/completed/failed queries are displayed in the Presto Console UI. You can also access query logs using the Presto Event Listener in combination with custom functions to listen to events happening inside the Presto engine and react to them. Event listeners are invoked for following events in Presto query workflow 1) Query creation, 2) Query completion, and 3) Split completion. There’s a worked example here http://dharmeshkakadia.github.io/presto-event-listener/ and the doc page is here https://prestodb.io/docs/current/develop/event-listener.html. This  method enables you to collect all the queries submitted to Presto for later analysis. 

If you are using Ahana Cloud then it exposes the query log in a catalog that can be easily queried, e.g. using presto-cli for convenience.  The catalog is called ahana_querylog and it uses the aforementioned Event Listener mechanism:

$ presto --server https://devtest.james.staging.ahn-dev.app 
presto:demo> show catalogs;
(9 rows)

presto:public> use ahana_querylog.public;
presto:public> show tables;

(2 rows)

presto:public> select * from querylog;
      type      | cluster_name |           ts            | seq |    user     |                               
 queryCreated   | devtest      | 2020-09-22 13:03:32.000 |   1 | jamesmesney | show catalogs                 
 queryCreated   | devtest      | 2020-09-22 13:03:54.000 |   3 | jamesmesney | use ahana_querylog            
 queryCreated   | devtest      | 2020-09-22 13:04:15.000 |   4 | jamesmesney | use ahana_hive                
 queryCreated   | devtest      | 2020-09-22 13:06:28.000 |   5 | jamesmesney | SHOW FUNCTIONS                
 queryCreated   | devtest      | 2020-09-22 13:15:13.000 |   8 | jamesmesney | show catalogs                 
 queryCreated   | devtest      | 2020-09-22 13:15:19.000 |  10 | jamesmesney | use ahana_hive                
 queryCompleted | devtest      | 2020-09-22 13:15:19.000 |  11 | jamesmesney | use ahana_hive                
 queryCreated   | devtest      | 2020-09-22 13:15:20.000 |  13 | jamesmesney | SELECT table_name 
 queryCreated   | devtest      | 2020-09-22 13:15:24.000 |  15 | jamesmesney | show tables                   
 queryCreated   | devtest      | 2020-09-22 13:15:40.000 |  16 | jamesmesney | SHOW FUNCTIONS                
 queryCreated   | devtest      | 2020-09-22 13:15:44.000 |  21 | jamesmesney | show tables                   
 queryCompleted | devtest      | 2020-09-22 13:15:55.000 |  25 | jamesmesney | use ahana_querylog            
 queryCreated   | devtest      | 2020-09-22 13:15:20.000 |  12 | jamesmesney | SHOW FUNCTIONS                
 splitCompleted | devtest      | 2020-09-22 13:15:44.000 |  22 | NULL        | NULL                          
 queryCreated   | devtest      | 2020-09-22 13:15:55.000 |  24 | jamesmesney | use ahana_querylog     

So now you know where and how to find Presto logs. For more information on all-things Presto check out our past presentations.

Presto equivalent of mysql group_concat

PrestoDB supports ANSI SQL and includes support for several SQL dialects, including MySQL dialect, making it easy to group and aggregate data in a variety of ways. However not ALL functions in MySQL are supported by PrestoDB! 

Let’s look at the really useful MySQL and MariaDB SQL function GROUP_CONCAT() – this is used to concatenate data in column(s) from multiple rows into one field. It is an aggregate (GROUP BY) function which returns a String, assuming the group contains at least one non-NULL value (otherwise it returns NULL). GROUP_CONCAT() is an example of a function that is not yet supported by PrestoDB and this is the error you’ll see if you try using it to get a list of customers that have ordered something along with their order priorities:

presto> use tpch.sf1;

presto:sf1> select custkey, GROUP_CONCAT(DISTINCT orderpriority ORDER BY orderpriority SEPARATOR ',') as OrderPriorities from orders GROUP BY custkey;

Query 20200925_105949_00013_68x9u failed: line 1:16: Function group_concat not registered

So what’s the workaround?  array_join() and array_agg() to the rescue! 

presto:sf1> select custkey,array_join(array_distinct(array_agg(orderpriority)),',') as OrderPriorities from orders group by custkey;
 custkey |                OrderPriorities                 
  111874 | 5-LOW,1-URGENT,2-HIGH,4-NOT SPECIFIED          
(only the first 5 rows displayed) 

If you do not want to use the DISTINCT operator (you want duplicates in your result set in other words) then simply drop the array_distinct() function from your query:

presto:sf1> select custkey,array_join(array_agg(orderpriority),',') as OrderPriorities from orders group by custkey;
 custkey | OrderPriorities                             
   94169 | 1-URGENT,4-NOT SPECIFIED,4-NOT SPECIFIED,1-URGENT,4-NOT SPECIFIED,3-MEDIUM,4-NOT SPECIFIED,3-MEDIUM,4-NOT SPECIFIED,5-LOW,4-NOT SPECIFIED,2-HIGH,5-LOW,4-NOT SPECIFIED                                                                                                                                                                                                        

You can of course specify the separator character – in the above example I’ve used a comma.

It’s worth noting that, like PrestoDB, there wasn’t a T-SQL equivalent of the MySQL GROUP_CONCAT() function in Microsoft SQL Server either.  However, T-SQL now has the STRING_AGG() function which is available from SQL Server 2017 onwards.

And hey, presto, you now have a working Presto equivalent of mysql group_concat.

How to Get the Presto CLI Tool

The Presto CLI provides a terminal-based interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal executable. Here are the steps:

  1. Choose one of three ways to install it: 
    • If you use Docker, install the Ahana sandbox in Docker
    • If you have a Mac and use brew, simply run “brew install presto”
    • Download only the presto-cli tar file from https://prestodb.io/download.html (this will work on any OS that has java installed) 

Note that the first two options will install PrestoDB server as well.  All three options will give you the presto-cli executable jar client that you can use to connect to any Presto cluster – including Ahana Cloud for Presto of course.

3. Next check your Ahana Cloud console for the Presto cluster endpoint URL. You will find it in the Cluster Connection details:

4. And now connect using that URL. Here’s an example that connects you to Ahana’s integrated querylog.

MacBook$ presto --server https://MyClusterName.app --schema public --catalog ahana_querylog
presto:public> select * from querylog;

      type      | cluster_name |           ts            | seq |    user     | query 
 queryCreated   | devtest      | 2020-09-22 13:03:32.000 |   1 | jamesmesney | show catalogs                                                                                                          
 queryCreated   | devtest      | 2020-09-22 13:03:54.000 |   3 | jamesmesney | use ahana_querylog                                                                                                     
 queryCreated   | devtest      | 2020-09-22 13:04:15.000 |   4 | jamesmesney | use ahana_hive                                                                                                         
 queryCreated   | devtest      | 2020-09-22 13:06:28.000 |   5 | jamesmesney | SHOW FUNCTIONS                                                                                                         
 queryCreated   | devtest      | 2020-09-22 13:15:13.000 |   8 | jamesmesney | show catalogs                                                                                                          
 queryCreated   | devtest      | 2020-09-22 13:15:19.000 |  10 | jamesmesney | use ahana_hive                                                                                                         
 queryCompleted | devtest      | 2020-09-22 13:15:19.000 |  11 | jamesmesney | use ahana_hive                                                                                                         
 queryCreated   | devtest      | 2020-09-22 13:15:20.000 |  13 | jamesmesney | SELECT table_name FROM information_schema.tables WHERE table_schema = 'default'                                        

One more example to get you started with the integrated tpch data:

MacBook$ presto --schema sf1 --catalog tpch

presto:sf1> show tables;











(8 rows)

presto:sf1> select count(*) from orders where orderdate < date '2020-09-01';




(1 row)

If you have a Windows machine and have downloaded the presto cli jar file from https://prestodb.io/download.html and having trouble running it try this (replace ‘nnn’ with the your version):

c:\ java -jar presto-cli-0.nnn-executable.jar --server <host:port> --catalog <catalog_name> --schema <schema_name> --user <user_name> --<password>

How To Convert Date String to Presto Date Format

PrestoDB supports ANSI SQL and includes support for several SQL dialects, so it’s straightforward to convert a date string to date format and vice-versa in various formats. Below we’ll discuss how to handle Presto date functions.

If you have ISO 8601 format dates or timestamps like “2020-09-16 14:27:00” it’s very straightforward – use the date() function. Let’s test a Presto insert:

presto:default> create table datetest1 (s1 varchar);
presto:default> insert into datetest1 values ('2020-09-16');
INSERT: 1 row
presto:default> select date(s1) as myDate from datetest1;

An alternative is to use the Presto cast(x as date) – it’s exactly the same:

presto:default> select CAST(s1 AS date) as myDate from datetest1;

Another alternative is to use the from_iso8601_date() function:

presto:default> select from_iso8601_date(s1) as myDate from datetest1;

>> Looking for an easy way to get up and running with Presto? Take a look at Ahana Cloud. <<

Ok but what if my dates are not ISO 8601? Like “16/09/2020” which is common in Europe and APAC? Let’s set-up some string-based test data again:

presto:default> create table datetest2 (s1 varchar);
presto:default> insert into datetest2 values ('16/09/2020');
INSERT: 1 row

Did you know PrestoDB supports MySQL dialect?  Let’s convert our string to a timestamp using the MySQL date_parse() function, which returns a timestamp type:

presto:default> select date_parse(s1, '%d/%c/%Y') as myTimestamp from datetest2;
 2020-09-16 00:00:00.000 

That works. And if we want that as a date instead of a timestamp, no problem just cast it:

presto:default> select cast(date_parse(s1, '%d/%c/%Y') as date) as myDate from datetest2;

Date_parse() is powerful:  For example you can use %c for non zero-padded month numbers, %e for non zero-padded days of the month, and it can handle two or four digit years. If your separators are “.” instead of “/” it can handle that. And if you have timestamps with “AM” or “PM” present then its ‘%p’ specifier can handle that too. 

So that’s how you convert date string to date format! 

If you’re looking for a more comprehensive dive into operating Presto, download our free ebook: Learning and Operating Presto

Want more Presto tips & tricks in your inbox? Sign up for our Presto community newsletter.

Common Questions/Errors:

Why am I getting the following error: ‘>’ cannot be applied to date, varchar (10)

Answer: You are trying to compare a string literal in your query to a date type in your schema. The operator in the error can be any operator >,=,<,>=,<=. To fix that, you need to cast the string to a date type or use a function to parse the string to date.

Example error: select now() as time_now where current_date = '2021-06-01'

Example solution:

Option 1: Cast 
select now() as time_now where current_date > date '2021-06-01'

Option 2: Cast Explicit
select now() as time_now where current_date > Cast ( '2021-06-01' as Date)

Option 3: Parse iso8601 date
select now() as time_now where current_date > from_iso8601_date('2021-06-01')

Option 4: MySQL parser, Non Standard Date format using date_parse 
select now() as time_now where current_date > date_parse('01-06-2021','%d-%m-%Y')

Option 5: Java DataTime for parser, Non Standard Date format using date_parse 
select now() as time_now where current_date > parse_datetime('01-06-2021','dd-MM-YYYY')

How do I convert string with timestamp to a date?

Answer: You are trying to accurately convert a string of the format such as ‘2021-05-13T19:02:57.415006+01:00’ to date and use it in your query for comparison. You need to parse the string using either date_parse or parse_datetime

Example problem: Need to convert the following timestamp to date: ‘2021-05-13T19:02:57.415006+01:00’

Example solution:

Option 1: MySQL parser, Non Standard Date format using 
select date_parse('2021-05-13T19:02:57.415006','%Y-%m-%dT%h:%i:%s.%f') AT TIME ZONE '+01:00'

Option 2: Java DateTime format parser, Non Standard Date format using 
select parse_datetime('2021-05-13T19:02:57.415006+01:00','YYYY-MM-dd''T''HH:mm:ss.SSSSSSZ') 

Want more Presto tips & tricks? Sign up for our Presto community newsletter.

Where to Find Presto Source Code & How to Work With It

The main branch: PrestoDB source code

Presto is an open source project that is developed in the open on the public Github repository: https://github.com/prestodb/presto. The prestodb repo is the original, master repo from when it was first developed at Facebook and subsequently open sourced the code base on Github in 2013 under the Apache 2.0 license, a permissive license which lets anyone download and use the code. 

Whether you call it master, trunk, upstream, or mainline, those all refer to the prestodb repo: https://github.com/prestodb/presto. That means it is the single, shared, current state of the software project. Many wonder how to work with Presto. Whenever you wish to start a new piece of work, you would pull code from this origin into your local repository and begin coding. The master repo is the single shared, codeline and represents the central repo and single point of record for the project. Most Presto clones or forks come from prestodb. 

It is worth noting that the Facebook team continues to develop and run the prestodb project in production at Facebook at scale. So the community benefits from the development and testing that Facebook and other companies who run presto. There are numerous other contributors of course, many of them working within organisations that have deployed PrestoDB at scale.

What about other forks?

As an open source project, community members can download the source code and work on it in their own public or private repos. Those members can decide to contribute changes back through the traditional github development process of pull requests, reviews, and commits. If you starting anew, it is generally recommended to always pull source code from the master, mainline prestodb open source repo:  https://github.com/prestodb/presto

Some members can decide to not contribute back and then that Presto version becomes a fork of the code. There are always a number of forks out in the community and as time goes on, the development tends to diverge away from the original codeline. The fork misses out on upstream changes and testing that companies like Facebook and others do, unless the fork is merged back with upstream. 

A close up of a map

Description automatically generated
Source: https://martinfowler.com/articles/branching-patterns.html

What about PrestoSQL source code?

PrestoSQL is a fork of the original Presto project. It has a separate github repository here: https://github.com/prestosql/presto

PrestoDB is the main project of Linux Foundation’s Presto Foundation. That Foundation has a wide ranging set of industry members including Facebook, Uber, Twitter, Alibaba. In addition, Starburst is another industry member who has joined. As such, PrestoDB is the project repo of Presto both today and in the future. 

How can I work with the Presto source code?

The easiest way is thru your own github account. If you don’t yet have a github account, it is free and easy to sign up. Then get to https://github.com/prestosql/presto and clone the repository into your github account. If you’d like to work off your laptop, you can use a wide variety of free tools, like atom or github desktop. Now you can get coding and compiling! Feel free join the PrestoDB slack channel to ask questions of the community  (or answer questions too!). There is also the Presto Users google group at https://groups.google.com/g/presto-users and Presto developers are active on Stack Overflow https://stackoverflow.com/questions/tagged/presto

How can I contribute back to the PrestoDB source?

When you’re ready to contribute your code back to the community, you can use Github to raise pull request and get your code reviewed. You’ll need to e-sign a Contributor License Agreement (CLA) that is part of all Linux Foundation projects. 

Note, that there are other ways to contribute back to the community. The code base is wide and varied, there’s opportunities to write or improve presto connectors, or other parts of the presto SQL engine. In addition to writing code, you could write documentation. 

Wrapping Up

Now you know how to find the Presto source code and understand the different forks that are out there. Hope to see you in the community!