By Dipti Borkar, Ahana Cofounder, Chief Product Officer & Chief Evangelist
I’ve been in open source software companies and communities for over 10 years now, and in the database industry my whole career. So I’ve seen my fair share of the good, the bad, and the ugly when it comes to open source projects and the communities that surround them. And I’d like to think that all this experience has led me to where I am today – cofounder of Ahana, a company that’s betting big on an open source project: PrestoDB. Let’s first talk about the problem we’re trying to solve.
The Big Shift
Organizations have been using costly, proprietary data warehouses as the workhorse for analytics over many years. And in the last few years, we’ve seen a shift to cloud data warehouses, like Snowflake, AWS RedShift and BigQuery
Couple that with the fact that organizations have much more data (from terabytes to 10’s and 100’s of terabytes to even petabytes) and more different types of data (telemetry, behavioral, IoT, and event data in addition to enterprise data), there’s even greater urgency for users to not get locked in to proprietary formats and proprietary systems.
These shifts along with AWS commoditizing the storage layer that is ubiquitous and affordable means that a lot more data is now in cloud object stores like AWS S3, Google Cloud Storage and Azure Blob Store.
So how do you query the data stored in these data lakes? How do you ask questions of the data and pull the answers into reports and dashboards?
The “SQL on S3” Problem
- Data lakes are only storage – there is no intelligence in the data lake
- If the data is structured, think in tables and columns, SQL is hands down the best way to query it – hey it’s survived 50+ years
- If data is semi-structured, think nested like JSON etc. SQL still can be used to query with extensions to the language
- But SQL on data lakes was complicated. Hadoop tried it, we know that didn’t work. While it sort of solved this problem, trying to get 70+ different components and projects to integrate and work turned out to be a nightmare for data platform teams.
- There was really no simple yet performant way of querying S3
That is where Presto comes in.
Presto is the best engine built to directly query open formats and data lakes. Presto replaced Hive at Facebook. Presto is the heart of the modern Open Data Lake Analytics Stack – an analytics stack that includes open source, open formats, open interfaces, and open cloud. (You can read more details about this in my Dataversity article.)
The fact of the matter is that this problem can be solved with many different open source projects/engines. At Ahana, we chose Presto – the engine that runs at Facebook, Uber and Twitter.
1. Crazy good tech
Presto is in memory, scalable and built like a database. And with the new innovations getting added, Presto is only becoming bigger and better.
Like I mentioned earlier, I’ve been in open source and the database space for a long time and built multiple database engines – both structured and semi-structured (aka NoSQL). I believe that PrestoDB is the query engine most aligned with the direction the analytics market is headed. At Ahana, we’re betting on PrestoDB and have built a managed service around it that solves the problem for Open Data Lake Analytics – the SQL on S3 problem.
No other open source project has come close. Some that come to mind –
- Apache Drill unfortunately lost its community. It had a great start being based on the Dremel paper (published by Google) but over time didn’t get the support it should have from the vendor behind it and the community fizzled.
- SparkSQL (built on Apache Spark) isn’t built like a database engine but instead is bolted on top of a general purpose computation engine
- Trino, a hard fork of Presto, is largely focused on a different problem of broad access across many different data sources versus the data lake. I fundamentally believe that all data sources are NOT equal and that data lakes will be the most important data source over time, overtaking the data warehouses. Data sources are not equal for a variety of reasons:
- Amount of data stored
- Type of information stored
- Type of analysis that can be supported on it
- Longevity of data stored
- Cost of managing and processing the data
Given that 80-90% of the data lives on S3, I’ve seen that 80-90% of analytics will be run on this cheapest data source. That data source is the data lake. And the need to perform a correlation across more data sources comes up only 5-10% of the time for a window of time until the data from those data sources also gets ingested into the data lake.
As an example: MySQL is the workhorse for operational transactional systems. A complex analytical query with multi-way joins from a federated engine could bring down the operational system. And while there may be a small window of time when the data in MySQL is not available in the data lake, it will eventually be moved over to the lake, and that’s where the bulk of the analysis will happen.
2. Vendor-neutral Open Source, not Single-vendor Open Source
On top of being a great SQL query engine, Presto is open source and most importantly part of Linux Foundation – governed with transparency and neutrality. On the other hand, Trino, the hard fork of Presto, is a single-vendor project, with most of the core contributors being employees of the vendor. This is problematic for any company planning on using a project as a core component of their data infrastructure, moreso for a vendor like Ahana that needs to be able to support its customers and contribute and enhance the source code.
Presto is hosted under Linux Foundation in the Presto Foundation, similar to how Kubernetes is hosted by Cloud Native Computing Foundation (CNCF) under the Linux Foundation umbrella. Per the bylaws of the Linux Foundation, Presto will always stay open source and vendor neutral. It was a very important consideration for us that we could count on the project remaining free and open forever, given so many examples where we have seen single-vendor projects changing their licenses to be more restrictive over time to meet the vendor’s commercialization needs.
For those of you that follow open source, you most likely saw the recent story on Elastic changing its open-source license which created quite a ripple in the community. Without going into all the details, it’s clear that this move prompted a backlash from a good part of the Elastic community and its contributors. This wasn’t the first “open source” company to do this (see MongoDB, Redis, etc.), nor will it be the last. I believe that over time, users will always pick a project driven by a true gold-standard foundation like Linux Foundation or Apache Software Foundation (ASF) over a company-controlled project. Kubernetes over time won the hearts of engineers over alternatives like Apache Mesos and Docker Swarm and now has one of the biggest, most vibrant user communities in the world. I see this happening with Presto.
3. Presto runs in production @ Facebook, runs in production @ Uber & runs in production @ Twitter.
The most innovative companies in the world run Presto in production for interactive SQL workloads. Not Apache Drill, not SparkSQL, not Apache Hive and not Trino. The data warehouse engine at Facebook is Presto. Ditto that for Uber, likewise for Twitter.
When a technology is used at the scale these giants run at, you know you are not only getting technology created by the brightest minds but also tested at internet scale.
And that’s why I picked Presto, that’s PrestoDB and there’s only 1 Presto.
In summary, I believe that Presto is the de facto standard for SQL analytics on data lakes. It is the heart of the modern open analytics stack and will be the foundation of the next 10 years of open data lake analytics.
Join us on our mission to make PrestoDB the open source, de facto standard query engine for everyone.