Snowflake vs Presto
This article touches on several basic elements to compare Presto and Snowflake.
To start, let’s define what each of these is. Presto is an open-source SQL query engine for data lakehouse analytics. It’s well known for ad hoc analytics on your data. One important thing to note is that Presto is not a database. You can’t store data in Presto but use it as a compute engine for your data lakehouse. You can use presto on not just the public cloud but as well as on private cloud infrastructures (on-premises or hosted).
Snowflake is a cloud data warehouse that offers a cloud-based data storage and analytics service. Snowflake runs completely on cloud infrastructure. Snowflake uses virtual compute instances for its compute needs and storage service for persistent storage of data. Snowflake cannot be run on private cloud infrastructures (on-premises or hosted).
Use cases: Snowflake vs. Presto
Snowflake is a cloud solution for your traditional data warehouse workloads such as reporting and dashboards. It is good for small-scale workloads; to move traditional batch-based reporting and dashboard-based analytics to the cloud. I discuss this limitation in the Scalability and Concurrency topic.
Presto is not only a solution for reporting & dashboarding. With its connectors and their in-place execution, platform teams can quickly provide access to datasets that analysts have an interest in. Presto can also run queries in seconds. You can aggregate terabytes of data across multiple data sources and run efficient ETL queries. With Presto, users can query data across many different data sources including databases, data lakes, and data lakehouses.
Open Source Or Vendor lock-in
Snowflake is not Open Source Software. Data that has been aggregated and moved into Snowflake is in a proprietary format only available to Snowflake users. Surrendering all your data to the Snowflake data cloud model is the ideal recipe for vendor lock-in.
Vendor Lock-In can lead to:
- Excessive cost as you grow your data warehouse
- When ingested into another system, data is typically locked into the formats of a closed source system
- No community innovations or ways to leverage other innovative technologies and services to process that same data
Presto is an Open Source project, under the Apache 2.0 license, hosted by the Linux Foundation. Presto benefits from community innovation. An open-source project like Presto has many contributions from engineers across Twitter, Uber, Facebook, Bytedance, Ahana, and many more. Dedicated Ahana engineers are working on the new PrestoDB C++ execution engine aiming to bring high-performance data analytics to the Presto ecosystem.
Open File Formats
Snowflake has chosen to use a micro-partition file format that is good for performance but closed source. The Snowflake engine cannot work directly with common open formats like Apache Parquet, Apache Avro, Apache ORC, etc. Data can be imported from these open formats to an internal Snowflake file format, but users miss out on performance optimizations that these open formats can bring to the engine, including dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes, avoid many small files, avoid few huge files, etc.
On the other hand, Presto users can run ad-hoc, real-time analytics, with deep learning, on those same source files previously mentioned, without needing to copy files, so there’s more flexibility that users get with this open data lake architecture. Using open formats gives users the flexibility to pick the right engine for the right job without the need for an expensive migration.
Open transaction format
Many organizations are adopting Data Lakehouse architecture and augmenting their current data warehouse. This brings the need for a transaction manager layer that can be supported by Apache Hudi, Apache Iceberg, or Delta Lake. Snowflake does not support all of these table formats. Presto supports all these table formats natively, allowing users more flexibility and choice. With ACID transaction support from these table formats, Presto is the SQL engine for Open Data Lakehouse. Moreover, Snowflake data warehouse doesn’t support semi/unstructured data workloads, AI/ML/data science workloads, whereas the data lakehouse does.
While Snowflake did decouple storage and compute, they did not decouple data ownership. . They still own the compute layer as well as the storage layer. This means users must ingest data into Snowflake using a proprietary format, creating yet another copy of data and also requiring users to move their data out of their own environment. Users lose ownership of their data.
On other hand, Presto is a truly disaggregated stack that allows you to run your queries in a federated manner without any need to move your data and create multiple copies. At Ahana, users can define Presto clusters, and orchestrate and manage them in their own AWS account using cross-account roles.
Scalability and Concurrency
With Snowflake you hit a limitation of running maximum concurrent users on a single virtual warehouse. If you have more than eight concurrent users, then you need to initiate another virtual warehouse. Query performance is good for simple queries, however, performance degrades as you apply more complex joins on large datasets and the only options available are limiting the data that you can query with Snowflake or adding more compute. Parallel writes also impact read operations and the recommendation is to have separate virtual warehouses.
Presto is designed from the ground up for fast analytic queries against data sets of any size and has been proven on petabytes of data, and supports 10-50s concurrent queries at a time
Cost of Snowflake
Users think of Snowflake as an easy and low-cost model. However, it gets very expensive and cost-prohibitive to ingest data into Snowflake. Very large amounts of data and enterprise-grade, long-running queries can result in significant costs associated with Snowflake as it requires the addition of more virtual data warehouses which can rapidly escalate costs. Basic performance improvement features like Materialized Views come with additional costs. As Snowflake is not fully decoupled, data is copied and stored into Snowflake’s managed cloud storage layer within Snowflake’s account. Hence, the users end up paying a higher cost to Snowflake than the cloud provider charges, not to mention the costs associated with cold data. Further, security features come at a higher price with a proprietary tag.
Open Source Presto is completely free. Users can run on-prem or in a cloud environment. Presto allows you to leave your data in the lowest cost storage options. You can create a portable query abstraction layer to future-proof your data architecture. Costs are for infrastructure, with no hidden cost for premium features. Data federation with Presto allows users to shrink the size of their data warehouse. By accessing the data where it is, users may cut the expenses of ETL development and maintenance associated with data transfer into a data warehouse. With Presto, you can also leverage storage savings by storing “cold” data in low-cost options like a data lake and “hot” data in a typical relational or non-relational database.
Snowflake vs. Presto: In Summary
Snowflake is a well-known cloud data warehouse, but sometimes users need more than that –
- Immediate data access as soon as it is written in a federated manner
- Eliminate lag associated with ETL migration when you can directly query from the source
- Flexible environment to run unstructured/ semi-structured or machine learning workloads
- Support for open file formats and storage standards to build open data lakehouse
- Open-source technologies to avoid vendor lock-in
- The cost-effective solution that is optimized for high concurrency and scalability.
Presto can solve all these user needs in a more flexible, open-source, secure, scalable, secure, and cost-effective way.
SaaS for Presto
If you want to use Presto, we’ve made it easy to get started in AWS. Ahana is a SaaS for Presto. With Ahana for Presto, you can run in containers on Amazon EKS making the service highly scalable & available. We have optimized Presto clusters with scale up and down compute as necessary which helps companies achieve cost control. With Ahana Cloud, you can easily integrate Presto with Apache Ranger or AWS Lake Formation and address your fine-grained access control needs. Creating a data lake with Presto and AWS Lake Formation is as simple as defining data sources and what data access and security policies you want to apply.
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.
Here, we are going to talk about AWS Athena vs Glue, which is an interesting pairing as they are both complementary and competitive. So, what are they exactly?