Snowflake may not be the silver bullet you wanted for your long term data strategy… here’s why

Since COVID, every business has pivoted and moved everything online, accelerating digital transformation with data and AI. Self-service, accelerated analytics has become more and more critical for businesses and Snowflake did a great job bringing cloud data warehouses into the market when users were struggling with on-prem big data solutions and trying to catch up their cloud journey. Snowflake is designed foundationally to take advantage of the cloud’s benefits, and while Snowflake has benefited from a first-mover advantage, here are the key areas you should think about as you evaluate a cloud data warehouse like Snowflake. 

Open Source; Vendor Lock-in

Using an SQL engine that is open source is strategically important because it allows the data to be queried without the need to ingest it into a proprietary system. Snowflake is not Open Source Software. Only data that has been aggregated and moved into Snowflake in a proprietary format is available to its users. Moreover, Snowflake is pushing back on open source due to their proprietary solutions. Recently, Snowflake announced Snowflake Data Cloud, where they position Snowflake as a platform for “Cloud Data” where organizations can move and store all the data. 

However, surrendering all your data to the Snowflake data cloud model creates vendor lock-in challenges: 

  1. Excessive cost as you grow your data warehouse
  2. If ingested into another system, data is typically locked into formats of the closed source system
  3. No community innovations or way to leverage other innovative technologies and services to process that same data

Snowflake doesn’t benefit from community innovation that true open source projects benefit from. For example, an open source project like Presto has many contributions from engineers across Twitter, Uber, Facebook, Ahana and more. At Twitter, engineers are working on the Presto-Iceberg connector, aiming to bring high-performance data analytics on open table format to the Presto ecosystem. 

Check this short session on an overview of how we are evolving Presto to be the next generation query engine at Facebook and beyond. 

With a proprietary technology like Snowflake, you miss out on community-led contributions that can shape a technology for the best of everyone. 

Open Format

Snowflake has chosen to use a micro-partition file format that might be good for performance but is closed source. The Snowflake engine cannot work directly with most common open formats like Apache Parquet, Apache Avro, Apache ORC, etc. Data can be imported from these open formats to an internal Snowflake file format, but you miss out on the performance optimizations that these open formats can bring to your engine, including dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes, avoid many small files, avoid few huge files, etc.

On the other hand, Presto users can use Tensorflow on the same open formats, like parquet and ORC, so there’s a lot of flexibility that you get with this open data lake architecture. Using open formats gives users the flexibility to pick the right engine for the right job without the need for an expensive migration.

While migrating from legacy data warehouse platforms to Snowflake may offer less friction for cloud adoption, trying to integrate open source formats into a single proprietary solution may not be as simple as sold.

Check this session on how you can leverage Apache Parquet, Apache Hudi and PrestoDB integration to build Open Data Lake.

Federated queries

A SQL engine is needed for both the data lake where raw data resides as well as the broad range of other data sources so that an organization can mix and match data from any source. If your data resides in relational databases, NoSQL databases, cloud storages, file systems like HDFS, etc., then Snowflake is not suitable for your self-service data lake strategy.. You can not run SQL queries across data stored in relational, non-relational, object, and custom data sources using Snowflake.

Workload Flexibility

Today users want to create new applications at the same rate as their data is growing and a single database is not a solution to support a broad range of analytical use cases. One common workload is training and using machine learning models right over warehouse tables or streaming analytics. Snowflake focuses on a traditional data warehouse as a managed service on the cloud and requires proprietary connectors to address these ML/DS workloads, which brings up data lineage challenges.

If you have a lot of unstructured data like text or images, the volume is beyond petabytes, or schema-on-read is a must-have feature, then Snowflake does not fit into your data lake architecture. 

The new generation of open platforms that unifies the data warehouse and advanced analytics is something that Snowflake is not fundamentally designed for; Snowflake is only suitable for data warehouse use cases.

Data Ownership

Snowflake did decouple storage and compute. However, Snowflake does not decouple data ownership. It still owns the compute layer as well as the storage layer. This means users must ingest data into Snowflake using a proprietary format, creating yet another copy of data and also requiring users to move their data out of their own environment. Users lose ownership of their data.

Cost

Users think of Snowflake as an easy and low cost model. However, it gets very expensive and cost prohibitive to ingest data into a Snowflake. Very large data and enterprise grade long running queries can result in significant costs associated with Snowflake.

As Snowflake is not fully decoupled, data is copied and stored into Snowflake’s managed cloud storage layer within Snowflake’s account. Hence, the users end up paying a higher cost to Snowflake than the cloud provider charges, not to mention the costs associated with cold data. Further, Security features come with a higher price with a proprietary tag.

Conclusion

Snowflake may sound appealing in how simple it is to implement a data warehouse in the cloud. However, an open data lake analytics strategy will augment the data warehouse system in places the warehouse may fall short as discussed above,  providing significant long-term strategic benefits  to users. 

With PrestoDB as a SQL Engine for Open Data Lake Analytics, you can execute SQL queries at high-performance, similar to the EDW. Using Presto for your data lake analytics means you don’t have to worry about vendor lock-in and gets you the benefits of open source goodness like RaptorX, Project Aria, Apache Ranger integration, etc. Check this short tutorial on how to query a data lake with Presto

While powerful, Presto can be complex and resource-intensive when it comes to managing and deploying. That’s where Ahana comes in. Ahana Cloud is the easiest managed service for PrestoDB in the cloud. We simplify open source operational challenges and support top innovations in the Presto community. 

As Open Data Lake Analytics evolves, we have a great and advanced roadmap ahead. You can get started with Ahana Cloud today.