Webinar On-Demand
Data Warehouse or Data Lake, which one do I use?

Slides

Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). 

There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization.

In this webinar, you’ll hear from industry analyst John Santaferraro and Ahana cofounder and CPO Dipti Borkar who will discuss the data landscape and how many companies are thinking about their data warehouse/data lake strategy. They’ll share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lake.


Webinar Transcript

SPEAKERS

John Santaferraro | Industry Analyst, Dipti Borkar | CPO & Co-Founder Ahana, Ali LeClerc | Moderator, Ahana

Ali LeClerc | Ahana 

Hi, everybody, welcome to today’s webinar, Data Warehouse or Data Lake, or which one do I use? My name is Ali, and I will be moderating the event today. Before we get started, and I introduce our wonderful speakers, just a few housekeeping items, one, this session is being recorded. If you miss any parts of it, if you join late, you’ll get a link to both the recording and the slides that we are going through today. Second, we have allotted some time for questions at the end. So please feel free to pop in your questions, there is a questions tab in your GoToWebinar panel. You can go ahead, ask away during the session itself, and we’ll get to them at the end.

So, without further ado, I am pleased to introduce our two speakers, we have John Santaferraro and Dipti Borkar. John is an industry analyst who has been doing this for over 26-years has a ton of experience in the space. So, looking forward to hearing his perspective. And then we’re also joined by Dipti. Dipti Borkar is the co-founder and CPO of Ahana, has a ton of experience in relational and non-relational database engines as well as the analytics market.

Today they’re going to be talking about data warehouse or data lake. So, with that, John, I will throw things over to you, please take it away.

John Santaferraro | Industry Analyst 

Awesome. Really excited to be here. This topic is top of mind, I think for everyone, we’re going to take a little look at history, you know, where did data lakes and data warehouses start? How have they been modernized? What’s going on in this world where they seem to be merging together? And then give you some guidance on what are use cases for these modern platforms? How do you know, how do you choose a data lake or a data warehouse? Which one do you choose? So, we’re going to provide some criteria for that, we’re going to look at the Uber technical case study and answer any questions that you guys have.

So, I’m going to jump right in I actually got into the data warehousing world all the way back in 1995. I co-founded a data warehouse startup company and eventually that sold to Teradata. Right now, I’m thinking back to those times. And really, the whole decade after that the traditional data warehouse was it was a relational database. Typically, with a columnar structure, although some of the original data warehouses didn’t have that, they had in database analytics for performance focused really only on structured data. The data had to be modeled. And data modeling, for a lot of folks was an endless task. And there was the whole ETL process was 70% of every project, extracting from all your source systems, transforming it, loading it into the data warehouse. There was primarily SQL access, and these data warehouses tended to be a few sources, one or two outputs, but they were expensive, slow, difficult to manage. They provided access to limited data. So, there were a lot of challenges, a lot of benefit as well, but a lot of challenges with the traditional data warehouses. So, the data lakes came along and initially Hadoop, you’ll remember this, was going to replace the data warehouse, right?

I remember reading articles about how the data warehouse is dead. This was the introduction of Hadoop with its file system data storage, suddenly, it was inexpensive to load data into the data lake. So, all data went in there, including semi-structured data, unstructured data, it was all about ingestion of data motored in [inaudible] the structure, once it had been loaded. Don’t throw anything out. Primary use cases were for discovery, text analytics, data science. Although there was some SQL access, initially, notebooks and Python, and other languages became the primary way to access. These data lakes were less expensive, but there was limited performance on certain kinds of complex analytics. Most of the analytics folks focused on unstructured data. There was limited SQL access, and they tended to be difficult to govern. Hadoop initially didn’t have all of the enterprise capabilities.

You know, Dipti your around through a lot of that what are your some of your memories about data lakes when they first showed up on the scene?

Dipti Borkar | Ahana 

Yeah. It’s great to do this with you, John, we’ve been at this for a while. I started my career in traditional databases as well, DB2 distributed, core storage and indexing kernel engineering. And we saw this entire movement of Hadoop. What it helped with is, in some ways, the separation of storage and compute. For the first time, it really separated the two layers where storage was HDFS, and then compute went through many generations. Even just in the Hadoop timeframe was MapReduce, Hive, variations of hive and so on. But what happened is the, you know, I feel like the company is the leaders that were driving Hadoop never really simplified it for the platform teams.

Technology is great to build, but if it’s complicated, and if it takes a long time to get value from, no matter how exciting it is, it doesn’t serve its purpose. And that was the biggest challenge with Hadoop, there were 70 different projects, that took six to nine months to integrate into, and to see real value or insights from the data in HDFS, and many of these projects didn’t go well. And that’s why over time, people struggled with it, we were, we’ll talk a little bit about cloud and how the cloud migration is playing such a big role in the in the modernization of these data lakes. So just some perspectives there.

John Santaferraro | Industry Analyst 

Yeah, you know, you just you reminded me as well, the other positive thing is that I think that we had seen open source as an operating system. And with the introduction of Hadoop, there was a massive uptake of acceptance and adoption around open source technology as well. So that was another real positive during that time. So, what we’ve what we’ve seen since the inception of the data warehouse, and you know, the incursion of Hadoop into the marketplace and the data lake, we’ve seen a very rapid modernization of those platforms driven by three things.

One is digital transformation, everything has moved to digital now, especially, massive uptake of mobile technology, internet technology is way more interactive and engaging than when it used to be informational, and tons of more data and data types. Along with that, there is an increasing need for engagement with customers and real time engagement with employees, engagement with partners, everything is moving closer and closer to either just in time or real time. And so that’s created the need to be able to respond quickly to business events of any kind. And I think third, we’re really on the cusp of seeing everything automated.

Obviously, we in the world of robotics, there are there are massive manufacturing plants, where everything is now automated. And that’s being transferred over to this world of robotic process automation. In order to automate everything that requires incredible intelligence delivered to machines, and sensors, and all kinds of, you know, every kind of device that you can imagine on the internet of things in order to automate everything. And so, these, these trends have really pushed us to the modernization of both the data warehouse and the data lake.

And interestingly enough, you can look at the slide that I popped up, but modernization is happening in all of these different areas, in both the data warehouse and the data lake. The most modern of both are cloud first. There is a there was a move to in-memory capabilities. On the Data Warehouse side, they’re now bringing in more complex data types that were typically only handled on the data lake and the modern data lake is bringing in columnar data types and with great performance. Now both have the separation of compute and storage. So, you can read the rest of them here. The interesting thing about the modernization movement is that that both the data warehouse and the data lake are being modernized.

What trends are you seeing and modernization, Dipti? I kind of tend to approach this at a high level looking at capabilities. I know you see the technology underneath and go deep in that direction. What’s your take on this?

Dipti Borkar | Ahana 

Yeah, absolutely. I mean, cloud first is really important. There’s a lot of companies that are increasingly just in the cloud, many are born in the cloud, like Ahana, but also their entire infrastructure is in the cloud. It could be multiple clouds, it could be a single cloud. That’s one of the you know, one of the aspects. The other aspect is within on clouds. containerization. A very, very big trend. A few years ago, Kubernetes wasn’t as stable. And so now today, the way we’ve built Ahana as cloud first, and it runs completely on Kubernetes. And it’s completely containerized. To help with the flexibility of the cloud and the availability, the scalability and leveraging some of those aspects. I think the other aspect is open formats.

Open formats are starting to play a big role. With the data lake, and I call it open data lakes, for a variety of reasons. Open formats is a big part of it. Formats like Apache ORC, Apache Parquet, they can be consumed by many different engines. In one way, you’re not locked into a specific technology, you can actually move from one engine to another, because many of them support it. Spark supports it, Presto supports it, TensorFlow just added some support as well. With a data lake, you can have open formats, which are highly performant, and have multiple types of processing on top of it. So, these are some of the trends that I’m seeing, broadly, on the data lake side.

And, of course, the data warehouses are trying to expand and extend themselves to the data lake as well. But what happens is, when you have a core path, a critical path for any product, it’s built for a specific format, or a specific type of data. We’ve seen that with data warehouses, most of the time it’s proprietary formats. And S3, and these cloud formats, might be an extension. And for data lakes, the data lake engines, are actually built for the open formats, and not for some of these proprietary formats.

These are some of the decisions and considerations that users need to think about in terms of what’s important for them. What kind of information they want to store – overtime, historical data – in their in their data lake or data warehouse? And how open do they want it to be?

John Santaferraro | Industry Analyst 

Yeah, I think you bring up a good point too, in that the modernization of the data lake has really opened up the opportunity for storage options. And specifically, lower cost storage options and storage tiering. So that in that environment, customers can choose where they want to store their data. If they need high performance analytics, then it goes in optimized storage of some kind. If what they need is massive amount of data, then they can store still in file systems. But the object storage, simple storage options are much less costly, and I think we’re I think we’re actually moving towards a world where, at some point in the future, companies will be able to store their data inexpensively, in one place, in one format, and use it endless number of times.

I think that’s the direction that things are going as we look at modernization.

Dipti Borkar | Ahana 

Absolutely. Look at S3, as the cloud, the most widely used, cloud store, it’s 15 years in the making. So, trillions of objects that are in S3. And now that it’s ubiquitous, and it’s so cheap, most of the data is landing there. So that’s the first place the data lands. And users are thinking about okay, once it lands there, what can I do with it? Do I have to move it into another system for analysis? And that might be the case as you said, there will be extremely low latency requirements, in some cases, where it might need to be in a warehouse.

Or it might need to be – of course, you know, operational systems will always be there – here we’re talking about analytics. And what other processing can I run directly on top of S3 and on top of these objects? Without moving the data around. So that I get the cost benefits, which, which AWS has driven through, it’s very cheap to store data now, and so can I have compute on top? Essentially to do structured analysis to do semi-structured analysis, or even unstructured analysis with machine learning, deep learning and so on.

S3 and the cloud migration, I would say, has played a massive role in, in this in the adoption of data lakes, and the move towards the modern data lake that you have here.

John Santaferraro | Industry Analyst 

So, you at Ahana, you guys talk about this move from data to insight and the idea of the SQL query engine. Do you want to walk us through this Dipti?

Dipti Borkar | Ahana 

Yeah, absolutely. I touched on some of the different types of processing that’s possible on top of data lakes. One of those workloads, is SQL workload. Data warehouses and data, lakes are sitting next to each other, hey sit next to each other. You have, in the data warehouse, you have, obviously you have your storage and your compute. And typically, these are in the 10s of terabytes. That’s the most of the most of the data warehouses tend to be along that that dimension of scale. But as the amount of data has increased, and the types of information has increased, some of them are contributing to a lot more data, IoT data. Device data, third party data, behavioral data. So it used to be just enterprise data, it used to be, orders line items, or when you look at a benchmark like TPC DS is very, very simple. It’s enterprise data. But now we have a lot more data. And that is that is leading to storage, and all of this information, going into the data lake. And the terabytes are now becoming petabytes, even for small companies. So that’s where the cost factor becomes very, very important.

Lower costs are what users are looking for, and infrastructure workloads on the top of that. Presto has come up as one of the really great engines for SQL processing on top of data lakes. It can also query other data sources from like MySQL RDS, and so on. But Presto came out of Facebook as a replacement for hive, which was essentially built for the data lake. And so, reporting and dashboarding use cases come great use cases of on top of Presto is interactive use cases, I would say. There’s also an ad hoc querying, use case that’s increasing. Most often, we’re seeing this with SQL notebooks – with Jupiter or, or Zeppelin and others – and then there’s also a data transformation workloads that run on top of the data lakes. Presto is, good for that. But there are other engines like Spark, for example, that actually do a great job.

They’re built for the ETL, or in database in data lake and a transformation and they play a big role in in these workloads that run on top of the data lake. So, what we’re seeing as we talk to users, Data Platform teams, Data Platform engineers, is there are a couple of paths. If they are pre-warehouse, and I call them kind of pre-warehouse users, where they’re not on a data warehouse yet. They’re still perhaps running a Tableau or Looker, or MySQL or Postgres, you now have a choice, for the first time, where you can actually run some of these workloads on a data lake for slightly lower costs because data warehouses could be cost prohibitive. Or one approach was augment the data warehouse. And so you start off with a data warehouse, you might have some critical data in there where you need very low latencies. It is a tightly coupled system. And so you’re going to get good performance. If you don’t need extreme performance, if you don’t have that as a criteria, the data lake option becomes a very real option today, because the complexity of Hadoop has sort of disappeared.

And there are now much more simpler solutions that exist from a transformation perspective, like managed services for Spark, as well as for my interactive querying and ad hoc [inaudible] perspective, managed services for Presto. That’s what we’re seeing in some cases. Users may skip the warehouse, in some cases, they may augment it and have some data in a warehouse and some data in a data lake. Thoughts on that, John?

John Santaferraro | Industry Analyst 

Yeah, I mean, just to just to confirm, I think you’re the diagram that you are presenting here shows Presto, kind of above the Cloud Data Lake. But there could be another version of this. If somebody has a data warehouse, and they don’t want to rip and replace, and go to an open source data warehouse, Presto sits above both the data lake and the traditional the data warehouse. So it can unify access for those same tools above it. SQL access for both the data lake, the open source data warehouse and the columnar data warehouse, isn’t that correct?

Dipti Borkar | Ahana 

Absolutely. And I think we’re at a point where we just have to accept that there is proliferation of databases and data sources, and there will be many. There is a time element where you know, not all the data may be in the data lake. And so for those for those use cases, federation, or acquainting across data sources, is how you can correlate data across different data sources. So if the data in, not just a data warehouse, but let’s say Elasticsearch, where you’re doing some analytics, it has not landed in the data lake yet. Or a data warehouse. It has not landed in the data lake yet, and the pipelines are still running, you can have a query that runs across both, and have a unified access to your data lake as well as your database, your data warehouse, or semi structured system of record as well.

John Santaferraro | Industry Analyst 

Awesome. Yes. So one of the things that I want to do for you as an audience, is help you understand what kinds of use cases are best for which platform. The modern data lake and the modern data warehouse. Because it is not one size fits all. And so what you see here is actually very similar things on both sides, very similar use cases. But I tried to rank them in order. And at this point, this is not researched. I do research as well. But this happens to be based on my expertise in this area. And, Dipti, feel free to jump in and agree or disagree. Guess what, hey, we’re on a webinar, but we can disagree, right? So on the data lake side, again, one through eight, high performance data intensive kinds of workloads.

We’re going to talk about a story where there is hundreds of petabytes of information and looking at exabyte scale in the next, probably in the next year or so. That is definitely going to happen on the modern data lake not on the modern data warehouse. The data warehouse on the other side, the modern data warehouse, super high compute intensive kinds of workloads with complex analytics, and many joins, may still work better on the modern data warehouse, both have lower cost storage. Again, back to the data lake side, massive scale, well suitable for many different kinds of data types – structured and unstructured – diversity of the kinds of analytics that you want to run.

And then as you get down toward the bottom, you know, things like high concurrency of analytics, you can see it’s up higher on the right-hand side, where you, again, with a columnar database, or you may be able to support higher levels of concurrency. Now, all of this is moving, by the way, because the modern data lakes are working on “how do I drive up concurrency?” They know they’ve got to do that. I would say that because databases have been around a little bit longer, some of the modern data warehouses have more built-in enterprise capabilities. Things like governance and, and other capabilities. But guess what? All of that is rising on the modern data lake side.

So, from my perspective, this is this is my best guess based on 26 years of experience in this industry. All of this is a moving target because things are constantly changing. Dipti, jump in and you certainly don’t have to agree with me. Think of this as a straw man. What’s your take on use cases for the, for these two worlds? Modern data lakes and modern data warehouses?

Dipti Borkar | Ahana 

Yeah, absolutely. I’m trying to figure out where I disagree, John. But in terms of the criteria, these are some of the criteria that our users come with and say, “Look, we are, we are looking at a modern platform for analytics. We have certain criteria, we want to future proof it.” Future proofing is becoming important, because these, these are important decisions that you make for your data platform. You don’t change your data platform every other day.

A lot of these decisions are thought through very carefully, the criteria are wade-in. And there are different tools for different set of criteria. In terms of data lakes, I would say that the cost aspects and the scale aspects are probably the driving factor for the adoption of data lakes. High performance, I think tends to be more data intensive, you’re right there. You can also run, obviously, a lot of high complexity queries as well on data lakes. Which take Presto, as an example of a query engine, you can still run fairly complicated queries.

However, to your point, John, there is a lot of state of the art in the database world 50 years of research. Of complex joints, optimizer, optimizations, and in general, that we are actually working on to make the data lake stronger to get it at par with the data warehouse. Depending on the kind of queries that are run, what we’re seeing that simple, simple queries, you know, with predicate push, simple predicates, etc., run really great on the lake. There might be areas where the optimizer may not be as capable of figuring out – what is the right way to reorder joins, for example. Where there’s work that’s going on. So I think that most of these are in line with what we’re seeing from a user perspective. The other thing that I would add is the open aspect of it. Most of the data lakes have emerged, the technologies, have emerged from internet companies. And the best part is that they open sourced it. So that has benefited all the users that now have the ability to run Presto or Spark or other things.

But from a warehouse perspective, it’s still very closed, there, there isn’t actually a good open source data warehouse. And as platform teams get more mature and get more skilled, they are looking at ways to interact and contribute back and say, “hey, you know, this feature doesn’t exist yet. Do I wait for a vendor to build it in three years from now? Or can I don’t have the ability to contribute back.” And that’s where the open source aspect that you brought up earlier, starts to play a bigger role, which is not on this list, but it’s also starting to be a big part of decision making, as users and platform teams look at data lakes. They want the ability to contribute back, or at least not get perhaps locked into some extent, and have multiple vendors or multiple people, organizations working on it together so that the technology improves. They have options and they can keep their options open.

John Santaferraro | Industry Analyst 

Yeah, yeah, great, great input. The, you know, the other trend that I’m seeing, Dipti, is the merging of the cloud data warehouse and cloud data lake. And those two worlds coming together. And I and I think that’s driven largely by customer demands, I think that there are still a lot of companies that are running a data warehouse, and they have a data lake. As we’ve talked about the modernization of both of those, and even similarities now, between them that weren’t there 10 years ago, there is a merging of the cloud data warehouses.

Customers don’t want to have to manage two different platforms, with two different sets of resources, two different sets of skill sets. It’s too much and so they want to move from two platforms to one, from two resource types to one, from self-managed, to fully managed, from complex queries joins trying to understand intelligence that requires both the data lake and the data warehouse to a simple way to be able to ask questions of both at the same time. And as a result of that, from disparate to connected intelligence, where I don’t have a separate set of intelligence, that I get out of my data warehouse and a separate set that comes out of the data lake, I have all of my data and I can amplify my insight by being able to run queries across both of those, or in a single platform, that that is able to do the work of what used to be done on the two platforms.

I’m seeing this happen from three different directions. One of them is that traditional data warehouse companies are trying to bring in more complex data types, and provide support for discovery kinds of workloads and data science. On the data lake side, great progress has been made with what you [inaudible] the Open Data Warehouse. Where you can now be able to analyze ORC and parquet files, columnar files. In the same way that you would analyze things on a on a columnar database. So there those two. And then the third, which, go Ahana, is this idea of why not? Why not take SQL, the lingua franca of all analytics, the most common language of all analytics still on the planet today, where there’s the most resources possible, and be able to run distributed queries across both, data lakes and data warehouses and bring the two worlds together. I think that I think this is a this is the direction that things are going, and Dipti, this is, this is where – kudos to Ahana for, you know, for really commercializing and providing support for and bringing into the cloud, all of the capabilities of Presto.

This is not the Ahana version of why I think this is a good idea. This is the John version. SQL access means you leverage this vast number of resources, and every company in the world, both on the technical and the business side, as people who understands and write SQL, better insight, because you’re now looking at data in the data lake in the data warehouse. Unified analytics, which means you can support more of your business use cases, with a distributed query engine. Distributed query engines means that you get to leverage your existing investment in platforms with limitless scale and for all data types. So this is this is my version of the capabilities.

Any thoughts you have on this, Dipti?

Dipti Borkar | Ahana 

Yeah, absolutely. I think that these two spaces are converging, right? There’s the big convergence that’s happening. The way I see from an architecture and technology perspective, is which one do you want to bet on for the future? Where is the bulk of your data? What is what is your primary path that you want to optimize for? The reason that that’s important is, that’s what that’s that will tell you where most of your data lives. Is it 80% in the warehouse? Is it 80% in the lake? And that’s an important decision. If, and this is obviously driven by the requirements, the business requirements that you have, what we’re seeing is that, you know, you have for some, some reports or dashboards where you need really kind of very, very high, high-performance access, the data warehouse would be a good fit.

But there is an emerging trend of different kinds of analysis, some of it which we don’t even know yet, that’s emerging. And having that on in a lake, and consolidating in a lake, gives you the ability to run these future proof kind of engines, platforms, whatever tools that come out on the lake. Because the technologies that are being built on – innovation, a lot more innovation is kind of happening on the lake side of it because of the cost profile of S3, GCS and others. That becomes the fundamental decision.

The next part and the good part is, even if you choose one way or the other, and I will have a bias for you towards the lake, because I think that’s where – I was on the warehouse, I’ve spent many years of my life on the warehouse – but from a future perspective, the next 10 years of analytics, I see that on the data lake. Either one you pick, the good part is you do have a layer on top that can abstract that and can give you access across both. And so you have now the ability which didn’t exist before, to actually query across multiple data sources.

Typically, we’re seeing that it’s the data lake, most people have a data lake, and then they want to query maybe one or two other sources. That’s the use case that we’re seeing. In addition, the cloud, you know, you talked about cloud and full service, is becoming a big, a big, big criteria for users, because installing tuning, then kind of ingesting data, running performance benchmarks, tuning some more, that phase of three to six to nine months of your blog, running POCs is not helping anyone.

Frankly, it doesn’t help the vendors as well, because we want, to create value for customers as soon as possible, right. And so with these managed services, what we’ve done and with Ahana, what I’ve done is we’ve taken three- or six-months process of installing and tuning to a 30 minute process where you can actually run SQL on S3 and get started in 30 minutes. This is in your environment, on your S3 using your catalog, it might be AWS Glue, or it might be a Hive meta store. And that is progress from where we were. And so that the data platform team can create value for their analysts, the data scientists, the data engineers a lot sooner, then with some of these other installed products.

So I see it as a few different dimensions, figure out your requirements, and then try to understand how much time you want to spend on the operational aspects of it. Increasingly, fully managed services are being picked because of the lower operational costs, and the faster time to insight from a data perspective.

John Santaferraro | Industry Analyst 

Great. So the other thing I want to leave you with as an audience is some considerations for any unified analytics decision. There are eight areas here to drill down into, I’m not going to go deep into these, but I want to provide this for you. So you can be thinking about eight areas of consideration as you’re choosing a unified analytics solution.

From a data perspective, what is the breadth of data that can be covered by this particular approach, in terms of unified analytics, moving forward? Looking at support for a broad range of different types of analytics, not just SQL, but Python, Notebook, search, anything that that enhances your analytic capabilities and broadens them, you want to make sure that your solution supports a broad set of users on a single platform. Everybody from the engineer to the business, and the analyst, and the scientist in between. It’s got to be cloud. In my opinion, Cloud is the future. Does the platform support enterprise requirements? All of the business requirements is it cost efficient from a from a business perspective? And then drilling down into the cloud. Looking at things like elasticity, which is automation. Scalability, mobility, because everything’s going mobile and [inaudible]? Am I able to do this as I expand to new regions?

In terms of drilling down on the enterprise – looking at security, privacy, governance, unification for the business is there – Does it support business semantics for my organization, and the logic that I want to include in it? Either in the product or on a layer above right either in some cases, it’s going to be through partners. Is it is it going to allow me to create measurable value for my organization and optimize? Create more value over time and then finally, in terms of costs, is it going to allow me to forecast my cost accurately? Contain costs over time? Looking at things like chargeback and scale, cost at scale. As this thing grows, anybody that’s doing analytics, that analytics program is growing.

So it’s got to be able to scale without just multiplying and creating incremental costs as you grow.

Dipti Borkar | Ahana 

One more thing I would add to costs, John, is the starting cost. So the initial cost, to even try it out, ad get started. This is important, because even the way the platform teams evaluate products and technologies is changing. And they will want the ability to have a pay as you go model. We’re seeing that be quite a bit useful for them. Because sometimes you don’t know until you’ve tried it out for a period of time.

What Cloud is enabling, is also a pay as you go model. So you initially you only pay for what you use, it could be it’s a consumption based model. You can, it might be compute hours, it might be storage, whatever, different vendors might do it different ways, but that is an important. Make sure you have that option. Because it will give you the flexibility of try things out in parallel. And you don’t have to have a exorbitant starting cost of trying out a technology. And the cloud is now allowing you to actually have that option be available.

John Santaferraro | Industry Analyst 

Yeah, yeah. Good. Good point. Dipti. So I had the privilege of interviewing Uber, both the user and developer of Presto, and what an incredible story. I was blown away. First of all, they hyperscale of analytics. Analytics is core to everything that Uber does. The hyperscale was amazing – 10,000 cities – and I’m just going to say it all, even though it’s right there in front of you to read, because it’s amazing. 18+ million trips every single day. They now have 256 petabytes of data, and they’re adding 35 petabytes of new data every day. They’re going to go to exabytes. They have 12,000 monthly, active users of analytics running more than 400,000 queries every single day. And all of that is running on Presto. They have all the enterprise readiness capabilities that they have from automation, workload management, running complex queries, security. It’s an amazing story. Dipti, I mean, you know this story well. What stands out to you about Uber and their not just their use of Presto, but their development of it as well?

Dipti Borkar | Ahana 

Yeah, absolutely. I mean, it’s an incredible story. And there’s many, many other incredible stories like this, where, you know, Presto was using being used at scale. If we refer back to your chart earlier, and we said, you know, scale, where does data lake fit in and where a data warehouse fits in, you probably would not be able to do this with a data warehouse. In fact, they migrated off a data warehouse. It was, I think, Vertica or something like that to Presto. They’ve completed that migration. And not just that they have other databases that sit next to Presto that Presto also queries.

So, you know, this is as perfect a use case for your unified analytics slide that you presented earlier. Because not only is it running on a data lake, it’s petabytes and petabytes of information, but it’s also actually abstracting and unifying across a couple of different a couple of different systems. And Presto is you is being done used for both. It is the de facto query engine for the lake. And it helps in some cases where you need to do a joiner or correlation across a couple of different databases. The other thing I’d add here is that not everybody is at Uber scale.

How many internet companies are there? But what we’re seeing is that users and platform teams, throw away a lot of data, and don’t store it because of cost implications of warehouses. The traditional warehouses, and also the cloud warehouses, may have double the cost. Because you have the data and your lake, but you also have to ingest it in another warehouse. And so you’re duplicating the storage cost. And you’re paying quite a bit more for your warehouse. And so instead of throwing away the data, because it’s cost prohibitive that’s where the lake helps. Store it in S3, you don’t have to throw a compute on it today.

But tomorrow, let’s say that data starts to become more interesting, you can very easily convert it to parquet or in a format – Presto can query JSON and queries, many different formats – and query it with Presto on top, from an analytics perspective, and correlated with other data that you have in S3. So, I would say that instead of, you know, aggregating data, and losing data, data is an asset. And most businesses are thinking about it in that way. It is on your balance sheet yet, there will be a time when you actually weigh the importance of the data you have.

If you have the ability to actually store all this data now, because it is cheap, you can use glacier storage, S3, AWS has really great [inaudible] where you have many different tiers of storage that are possible. And that’s a starting point. So that way, you have the option of a lake and building on a very powerful lake on top of that data, if and when you choose to. So just a few thoughts on that.

John Santaferraro | Industry Analyst 

Yeah, I think the other thing I was impressed with, and I think this is relevant to any size company is the breadth of use cases that they’re able to run on Presto. They’re doing their ETL, data science, exploration, OLAP, and federated queries all on this single platform. They really are contributing back to the Presto open-source code to push real time capabilities with it connection with Pinot sampling, being able to run queries on a sampling of data automatically written more optimizations to increase the performance. And you probably are intimately involved in the open-source projects that are listed here as well.

So, I think that it, it bodes well, for the future of Presto and for the future of Ahana.

Dipti Borkar | Ahana 

Yeah, it’s incredible to be partnering in a community driven project. There are many projects. Presto is a part of the Linux Foundation. And so, it’s a community driven project – Facebook, Uber, Twitter, Alibaba – they founded it Ahana is very early member of the project.

We contribute back, we worked together project Aria, for example that you see here came out of Facebook for optimizing ORC. We are working on ARIA for Parquet. Parquet is a popular format, that Uber can use, and Facebook can use, and other users can use as well. There are other projects, as well, for example, the multiple coordinator project. Presto initially had just one coordinator. And now there’s an alpha available where you have multiple coordinators that it that extends the scale, even beyond for Presto. Reduces the scheduling limitations, we were already talking about 1000s of nodes, but in case you needed more, it can go even beyond. But these are important.

These are important innovations, the performance dimension, and the scale dimension tends to be Facebook, Uber, we are also working on some performance. But the enterprise aspects like security, and governance, high availability, cloud readiness, those are aspects that Ahana is focused on and bringing to the community as well. And excited to see the second half, we have a second half roadmap for the for the for Presto, and excited to see how that comes along.

John Santaferraro | Industry Analyst 

Awesome. So, we started this session by talking about the complexity of Hadoop and open source when it was first launched. And quite frankly, nobody wants to manage 1000 nodes of Presto, unless you’re Ahana maybe? But so, let’s talk about Ahana. What have you guys done to simplify the use of Presto and make it immediately available for anybody who wants to do it? What’s going on with Ahana?

Dipti Borkar | Ahana 

Yeah, absolutely. And maybe, Ali, if I can share a couple of slides just bring up what that looks like in a minute. Okay. John, do you see the screen? Alright? Yes, I do. Okay, great. So Ahana is essentially a SaaS platform for Presto. We’ve built it to be fully integrated, it’s cloud native, and it’s a fully managed service that gives you the best of both worlds. It gives you the ability to have visibility into your clusters, number of nodes, and things like that, but also is built to be very, very easy, so that you don’t have to worry about installing, configurating, tuning variety of things.

How it works is pretty straightforward. You go in and you sign up for Ahana, you create an account. The next thing that you do is we create a compute plane in the users account, in your account. And we set up the environment for you, this is a one time thing, it takes about 20 to 30 minutes. This is bringing up your Kubernetes cluster, setting up your VPC, and your entire environment from all the way from networking on the top to the operating system below. Then from that point, you’re ready to create any number of Presto clusters. And so, it’s a single pane of glass that allows you to create different clusters for different purposes.

You might have an interactive workload for one cluster, you might have a transformation workload for another cluster. And you can scale them independently and manage them independently. So, it’s really, really straightforward and easy to get started. All of this is also to the AWS Marketplace. We’re an AWS first company and product is available pay as you go. So, we only charge for the Presto usage that you might have on an hourly basis. And so that’s really kind of how it works.

At a high level, just to summarize some of the important aspects of the platform, one of the key decisions we made is – Do you bring data to compute? Or do you take compute and move it to data? I thought about it from a user perspective. This was an important design decision we made. Increasingly users, data is very valuable, as I said earlier, incredibly valuable, and users want to move it out of their environment. Snowflake and other data warehouses are doing incredibly well. But if they had a choice, they would keep it in their own environment. What we’ve done is we take Presto, anything that touches data, Presto clusters, Hive meta store, even the Superset, so we have an instance of superset that provides a an admin console for Ahana, all of these things run in the user’s environment and the users VPC. None of this information ever crosses over to Ahana SaaS. And that’s very important.

From a governance perspective, there’s a lot of GDPR requirements increasingly, and so on. That’s, that’s the way it’s designed at a high level. Of course, as you mentioned, John, we connect to the data lake, that’s our primary path. 80% of the workloads we see are for S3, but the 5% to 10% might be for some of the other data sources. You can federate across RDS, MySQL, Redshift data warehouse, for example, Elastic and others. And we have first class integrations with Glue. Again, very, very easy to integrate, you can bring your own catalog, or we can, you can have one that were created with a click of a button in Ahana.

You can bring your own tools on the top, it’s standard JDBC ODBC. As you said, SQL is the lingua franca, it’s anti SQL. Presto is anti SQL. And so that makes it very easy to get started with any tool on top, and to integrate it into your environment.

So that’s a little bit about Ahana. And I think that might bring us to the end of our discussion here.

Ali LeClerc | Ahana 

Great. Well, thank you, Dipti and John. What a fantastic discussion and I hope everybody got a good overview of data, lakes, data warehouses, what’s going on in the market and kind of how to make a decision on and which way to go. So, we have a bunch of questions. I don’t think we’re going to have enough time to get to all of them. So, I’m going to ask some of the more popular ones that have kept popping up.

So first is around Presto. Dipti, probably for you, is Presto a means of data virtualization?

Dipti Borkar | Ahana 

Yeah, so that’s a good question. Presto was built as a data lake engine. But given its pluggable architecture, it is also able to support, whether you call virtualization I would say is an overloaded term, it means many things. But if it means accessing different data sources, then yes, it’s capable of doing that like you just saw in my last slide.

Ali LeClerc | Ahana 

Great. And by the way, folks, we do have another webinar. This is the first webinar in our series. Next week, we’ll be going into more detail on how you can actually do SQL on the data lake. Highly recommend if you’re interested in learning more, going a bit deeper, checking that out. I dropped the link to register in the chat box. So, feel free to do that.

So, question, I think I think for both of you Dipti, earlier, you touched on this idea of augmenting the data warehouse versus perhaps skipping the data warehouse altogether. And so, Dipti and, John, I think you both kind of bring a different perspective to that. What are you seeing in the market? Are people facing that decision? Or is it leaning one way or the other what’s going on around augmenting versus skipping?

John Santaferraro | Industry Analyst 

One of the trends that I’m seeing is that when data originates in the cloud, it tends to stay in the cloud, and it tends to move to a modern architecture. So, in truly digital instances, for organizations, rather than taking digital data and trying to get it back into a legacy or traditional data warehouse, that’s almost always going into a data lake and into which you know, I love what you term Dipti, the Open Data Warehouse using those formats.

That said, as people continue to migrate to the cloud – when I was at EMA, there was a, we saw that approximately 53% of data was already in the cloud. But that means 47% of the data is still on premise. And so, if the data is already there, and in a database, that migration may or may not make sense. You have to kind of weigh the value. And oftentimes the value is having it all in a single unified analytics warehouse.

Dipti Borkar | Ahana 

Right? Yeah, what I would say is that I think it depends on the cloud or on prem, most of our discussion has been in the cloud, because we are forward looking people, forward thinkers. But the truth is, there really is a lot of data on prem. On prem, what we’re seeing is that it will always almost be augment.

Because most folks will have a warehouse, whether it’s Vertica or Teradata, or DB2 Oracle, whichever it is, and they might have an HDFS, kind of a Hadoop kind of system on the side.  And that’s, that would be augment that’s more traditional. In the cloud, I think we’re seeing both, we’re seeing that for users who have been on the warehouse, they are choosing to augment and not just migrate off completely. And I think that that is the right thing to do, you do want to have a period of time. When I say period, it’s a years of time, where if you have a very mature warehouse, it will take some time to migrate that workload over to the lake. And so new workloads will be on the lake, old workloads will slowly migrate off. So that’s the way we see it’s really augment for a period of time.

You know, I often joke that mainframes are still around. So, warehouses aren’t going anywhere. And so that’s the argument. Now, the pre-warehouse users who don’t have a warehouse yet, are choosing, I would say, that percentage will continue to increase. I’m seeing that about 20-30% are choosing to skip the warehouse. That will only increase as more capabilities get built on the lake. Transactionality is very early right now. Governance, which just starting to get to column level, row level, masking, filtering, and masking and so on. So, there’s some work to be done.

We have our work cut out for us on the lake. I see it as a three-to-five-year period, where this will then start moving and more and more users will end up skipping the warehouse and moving to the lake. But today, it’s depending on the use cases, the simple use cases, we are seeing them about 20-30% are just going directly to the lake.

Ali LeClerc | Ahana  Wonderful. So, with that, I think we are over time now. We appreciate everybody who stuck around and stayed a few minutes past the hour. We hope that you enjoyed the topic. John, Dipti – what a fantastic conversation. Thanks for sharing your insights into this into this topic. So, with that, everybody, thank you. Thanks for staying with us we hope to see you next week and see you next time. Thank you.

Speakers

John Santaferraro
Industry Analyst

Dipti Borkar
Cofounder & CPO, Ahana