Presto Speed

Presto is an open source distributed ANSI SQL query engine for analytics. Presto supports the separation of compute and storage (i.e. it queries data that is stored externally – for example in Amazon S3 or in RDBMSs).  Efficiency and speed are important for query performance so Presto has a number of design features to maximize speed such as in-memory pipelined execution, a distributed scale-out architecture, and massively parallel processing (MPP) design.  In terms of specific performance features Presto supports:

  • Compression  (SNAPPY, LZ4, ZSTD, and GZIP)
  • Partitioning 
  • Table statistics – collected by the ANALYZE command and stored in a Hive or Glue metastore – give Presto’s query planner insights into the shape, size and type of data being queried, and whether or not the data source supports pushdown operations like filters and aggregates. 
  • Presto uses a cost-based optimizer, which as you would often expect depends on collected table statistics for optimal functioning. 

As Presto is an in-memory query engine, it can only process data as fast as the storage layer can provide it. There are MANY different types of storage that can be queried by Presto, some faster than others. So if you can choose the fastest data source this will boost Presto’s speed. This may involve tuning the source to reduce latency, increase throughput, or both. Or switching from accessing a data source that is busy dealing with lots of users and therefore high levels of contention, to an alternative – perhaps a read replica of a database. Or creating indexes, or accessing a pre-aggregated version of the data. Or perhaps moving portions of frequently used data from object storage to a faster storage layer like a RDBMS in order to meet a strict query SLA.  Other suggestions include switching to one of Presto’s supported file formats that features performance optimizations like ORC or Parquet, and consider enabling compression. 

To assist with speed tests Presto has TPC-DS and TPC-H catalogs built-in to generate data for benchmarking purposes at varying scales.  For example the 1TB TPC-H dataset consists of approximately 8.66 billion records, in 8 tables. 

Need more speed? Deploy Presto in the cloud and then scale your cluster to as many or as few instances as you need, when you need them.  This “elasticity” is being made increasingly automatic by Ahana’s fully managed cloud-native Presto service