share img roundtable hudi aws ahana speakers

On-Demand Presentation

Community Roundtable: Open Data Lakes with Presto, Apache, Hudi & AWS S3

As we see more companies augment the traditional cloud data warehouses and in some cases replace their data warehouses with cloud data lakes, a new stack has emerged that supports data warehouse workloads that weren’t possible on a data lake before while bringing a lot more advantages like lower cost, flexibility and no lock-in with open formats and open interfaces. 

The new stack: Presto + Apache Hudi + AWS Glue and S3 = The PHAS3 stack

Unlike the cloud data warehouse which is closed source, has data stored in proprietary formats, tends to be very expensive, and assumes data needs to be ingested and integrated into one database to provide the critical business insights for decision-making, the PHAS3 stack is open, flexible, and affordable.

In this roundtable discussion experts from each layer in this stack – Presto, AWS, and Apache Hudi – discuss why we’re seeing a pronounced adoption to this next generation of cloud data lake analytics and how these technologies enable open, flexible, and highly performant analytics in the cloud.


Webinar Transcript:

SPEAKERS

Vinoth Chandar | Hudi, Roy Hasson | AWS, Dipti Borkar | Ahana, Eric Kavanaugh | Bloor Group

Eric Kavanaugh | Bloor Group

Ladies and gentlemen, hello and welcome to the Community Roundtable. Yours truly, Eric Kavanaugh, is here, frankly humbled to be with such experts. We’re going to talk about open data lakes with the new stack. And this is a great moniker, we’re actually using it ourselves for something slightly different. But they’re both along the same lines. The new stack refers to the new technology components that you can weave together to build out your enterprise computing platform. And what’s happening these days is absolutely amazing.

So folks, open data lakes with Presto, Apache Hudi, and AWS Glue, and of course, S3, the next generation of analytics. We’re going to talk to my good friend Dipti Borkar, from Presto, also Vinoth Chandar, from Apache Hudi. And Roy Hassan, and there he is, he was the voice of the cloud a minute ago, now he’s visual. So AWS has materialized in our view.

We’re going to talk about what this all really means it folks, just very quickly, I’m very excited and very bullish about what we’re seeing here. I’ve been tracking this industry for over 20 years. It was about 15 or 16 years ago that I interviewed a guy named Michael Stonebreaker, Dr. Michael Stonebreaker, who was talking about how one size does not fit all. Again, this is 2005, he was promoting something called Vertica. And he’s basically saying that, look, the relational database has won for some reason in the enterprise, yet his theories about that. But he said, that doesn’t make sense. There’s a need for purpose, build specialized technologies, for use cases, that relational is not very good at doing. And so he was pushing Vertica back then, which of course is now part of HPE. It’s kind of bounced around a bit, but column oriented database. Around the same time, I also started researching open source and I remember distinctly kind of crystallizing thoughts in my head about the service oriented architecture, and open source technology. And I thought to myself, this is some interesting handwriting on the walls, it’s going to impact the major players sooner or later, Oracle, IBM, SAP, etc. So these guys are going to have to wake up to what’s happening here, because SOA, if done properly, will enable the mixing and matching of component parts.

Well, fast forward 16 years, here we are today, talking about the new stack. What happened? Open source is a huge part of the equation here, folks, open source has recast the enterprise development world. What you’re going to see today from our guests, I’ve like I said Vinoth Chandar, from Apache Hudi, Roy Hassan from AWS, Dipti Borkar from Presto. What you’re going to hear about is what this new stack really means. And it’s very exciting because we basically taken almost like the old database, that was great. We’ve taken that as a microcosm and build a macrocosm out of it. So now you have different component parts like Hudi, like Amazon Glue, like S3, like Presto, for example, the folks that Presto, what a Ahana is bringing to the table as well.

This is a new stack, a new way of doing things. And of course, we saw with Snowflakes IPO, that data warehousing is alive and well. But that’s sort of a closed circuit way of going about it right? You’re trapped inside of Snowflake, you have to pay them for compute. They’re very clever about separating compute and storage, and they’re very clever about spinning up warehouses and taking them down. But nonetheless, it’s still a closed environment. We’re going to talk about the open environment today.

So let’s go around the room and introduce our guests. I’ll ask them to introduce themselves Dipti Borkar, I’ll throw it over to you first, tell us a bit about yourself and what Presto is.

Dipti Borkar | Ahana 

Yeah, hello, everyone, and great to be here with Vinoth, Roy and Eric have worked and interacted with you on various different projects. Looking forward to this discussion. I’m the co founder and the Chief Product Officer at Ahana and also chair of the Presto Foundation Community team, been in open source for over 10 years.

You know, you talked about the range of databases, I spent a lot of time on the relational database warehouse with distributed DB2 and the core storage and indexing kernel, spent time and transition to NOSQL at Couchbase. Many years there. Building a range of different technologies, SQL on JSON, and fast forward a few years, we are seeing a new paradigm emerge with data lakes and building SQL for S3. So Presto essentially, is a distributed query engine. it’s an Open Source Engine created at Facebook, Open Source by them. It’s part of the Linux Foundation, under the Presto Foundation, and it’s built to be a great engine for data lakes, as well as other databases.

As you mentioned there is polyglot persistence as many different options that people have, and you can also federate across with Presto. Primarily, it’s being used on top of the data lake. You know, you mentioned Snowflake, this stack is really an open lake. You have an augmenting, you have an approach where we’re augmenting some of these data warehouses. I’m seeing a lot of different users, community users, customers move to this stack, where you have a query engine, you have a transaction manager layer in there, you have a data catalog, like AWS Glue. And then obviously, the Cloud Object Storage, which is S3. So that’s a little bit about me and Presto.

Vinoth over to you.

Vinoth Chandar | Hudi

Yeah. Hey, my name is Vinoth and I’m the PMC chair for the Apache Hudi project at the ASF. And yeah, my background: I’ve done databases for or a decade now. Basically, a one trick pony. I started on databases at Oracle working on CDC, you know, Oracle GoldenGate, Extreme, all of these products. Then let the [inaudible] key value store at LinkedIn, through the kind of the hyper growth phases of LinkedIn. Then, you know, Bolt, another brief standard box where it was building, like, kind of like a Firebase replacement, then landed at Uber, where we created Apache Hudi. Since then, we’ve been also growing a project outside.

Most recently, I was at Confluent, where I was working on ksqlDB. And to me, you know, Hudi underscore, started with transactional layer on top of some Hadoop compatible, Hadoop file system compatible storage, like, you know, HDFS or S3, or object stores in general, brought mutability to data that you would store on top of these, you know, object stores, indexing, and you know, all of the functionality that you need to build kind of an optimized data plane on top of object storage. That’s what Hudi provides.

Over the years, we also built a good set of platform components on top of this layer that also complete the picture in terms of – how do you bring the data in – external data ingestion, and kind of self management, just like how databases have a lot of demons that are optimizing it in the background for you, Hudi already comes with a runtime that did all of this happens out of the box for you.

Eric Kavanaugh | Bloor Group 

And last but not least, Roy Hasson from AWS. Tell us a bit about yourself.

Roy Hasson | AWS 

Sure. Hi, everyone. Again, my name is Roy Hasson, product manager on the AWS Glue and AWS Lake Formation team. I’ve been with AWS for about five years, actually five and a half years almost, working with a lot of different customers on building these type of data lakes. I’ve also been heavily involved in the early launch of Amazon, Athena and AWS Glue services. So been in the weeds with a lot of customers, really trying to take this vision and implement it in a way that is scalable and meets their needs. Definitely learned a lot throughout these years. This is kind of the feedback that we’ve been pushing into our services to try to make it easier, to make them better integrated. And we can talk about kind of what we’re doing.

I think generally speaking, when we talk about Glue, we’re really referring to the Glue data catalog here, where they provide a central metadata repository for everything that you need inside your data lake. So instead of having your data sort of inventoried in multiple catalogs, the glue catalog gives you a place to inventory your data, but also a way to access it. When we layer things like Lake Formation, on top of that, now we can add security and governance in our data lakes.

It’s really about cataloging the data that exists in your lake, but also making it accessible through different tools. Ahana, Athena, Redshift, etc, etc, in a central way, that is easier for users to find data and access it. So that’s kind of the gist of it. Looking forward to this conversation.

Eric Kavanaugh | Bloor Group 

Yeah, let’s go around the room and have you each describe the new stack. From your perspective, I kind of hinted at this in the opening, Dipti, I’ll throw it over to you, that what we’re seeing now is quite fascinating. It’s the open source community and the committers, who are involved, really creating these components that can then be brought together into a stack. And of course, in the old monolithic way of doing enterprise software, you would have that monolithic system, which would take care of things.

Like the database, for example, would do all sorts of different things, indexing caching into, you know, pre-preparing data, analytics, all this fun stuff could be done inside of a database. But then we realized that’s not very scalable. When you start dealing with the scale of the internet today and the business markets that are that are just growing by leaps and bounds, you can’t do things the old fashioned way.

And so now what we’re doing is developing each of these components as a scale out unit itself. So you can scale out wherever you need, is it at the storage? Is at the analytical capability? What is it that you need done? Each of these components are getting that done in their own scalable way. And I think that’s the real key – Dipti is scale?

Dipti Borkar | Ahana

Yeah. It’s interesting, when we were discussing, preparing for the session, I went back to the first blog I wrote at Ahana had this in there, which is, you know, holy grail of databases, right? When you start in a database class this is what you learn. You have the stack, you have the full stack, you have, all this for your clients, you have your query engine, which is your parser compiler, optimizer query rewrite execution engine, and then you have the transaction storage manager. You have a buffer pool, you have lock manager, you’re logging, and then there’s the catalog and other utilities.

What’s happened over the last few years, and in some ways, Hadoop kind of started it off. But it got extremely complicated with 70 different projects. We seem to have learned from some of those lessons where now, a cleaner stack is emerging. Where Presto, as an example, was built to be a database query engine, as opposed to MapReduce or Spark, which were more general purpose computational engines. You have this box, which is now Presto, you have some of these parts, the log manager, some aspects of locking, some aspects of the buffer pool, that are, Hudi and there’s others in there, like Delta and so on. And then you have the catalog, which manages the schema for the database, and the schema for the tables. Tables, columns, and everything we know about databases. So in some ways, the stack has been split across. And it’s now coming together as one stack with Presto, Hudi, Glue S3. We’re starting to see popularity of this stack. We’re kind of calling it the phase stack “PH”, you know, in AWS S3.

What we’re seeing is, that there’s a few reasons for this. Data lakes by themselves, which is S3, were immutable. They’re immutable, right? They’re object store. So you can’t really do anything with just that data, you need the intelligence on the top. You had query engines, like Presto, that came out that could query this data with a catalog, like a Hive meta store or Glue. But even then the data was immutable. So you really couldn’t do updates, inserts, updates, deletes, they had to be at a production level, there was some restrictions over this. So you couldn’t run the traditional data warehousing workloads on the data lake. That’s where with the new stack and the emergence of some of these new layers, you are now seeing the ability for the first time to run very traditional data warehouse workloads on a lake, right. And that’s where the real value to end users comes in.

Where you get flexibility of open formats like ORC, 4k, these are open format, so you’re not locked into a certain proprietary format. You have the ability to run other workloads on the same data without moving it around. And you get the scale. Because data is of different types, you have structured, semi structured JSON, etc. And the newer query engines, like Presto, allow you to query a range of different types of data. So you can have objects that have JSON or CSV, or orc parky, obviously, the more performing formats, and others. So you can query all of this in place and run other engines on top of that same lake. That’s the vision of this open data lake, where you get the best of the data warehouse, but you get the flexibility and the lower cost and the scale for the next 10-20 years.

That’s how I explain and at a high level, the transition over and where these pieces fit in. Back to you, Eric.

Eric Kavanaugh | Bloor Group

Yeah, and Vinoth, maybe you could share your thoughts too, because, again, what we’re seeing is a focus on each component part. You of course, are focused on Hudi these days that sort of transactional layer. Can you talk about what gets done in there and how we’re able to create greater elasticity?

Because we’ve decoupled these components, at least the hard decoupling, they’re loosely coupled now. That was also something from SOA, loosely coupling. Same principles are now being adopted at a different scale.

But tell us about your perspective, Vinoth.

Vinoth Chandar | Hudi

Yeah, that’s actually a very, very insightful question. This goes back to even the days when we were actually thinking about designing something like Hudi Aduba. So on day one, we had to support, like three engines like Hive, Presto and Spark. When you look at how things were then, each engine was good and has its own use case, and we had to design it that way. Even with respect to a catalog, we carefully designed Hudi as a transactional storage layer. While we can interact with something like Glue, or something like Hive meta store or some of these catalogs, in a very decoupled way. What this allowed us to do was actually horizontally scale, let’s say the writing or the indexing capabilities or everything, with the elastically. We can have 1000 cores ingesting data in while you can choose however many and you can pick an engine of choice for your query. And this allowed for greater flexibility for us.

And it also, in my opinion, unlocked very different use cases that are probably not possible in the traditional [inaudible] as well. For example, if you were to want to lower the data latency in data lakes, with the horizontal scalability that you get from this kind of decoupled model, you’re able to actually, you’re easily able to choose to throw, let’s say, more executors, or course or the writers to achieve the latency that you want. It’s very tunable. You don’t have to, like downsize and resize a warehouse, you can just focus on, sizing your warehouses based on a steady query workload. I think this model in general, gave up long running servers. And we can get into what we really lost from that. But we gained a lot more scalability and elasticity, in my opinion.

Eric Kavanaugh | Bloor Group  16:59

Do you mind diving into that real quick? When you say “we gave up long running servers, we get a lot more.” What do you mean by that exactly?

Vinoth Chandar | Hudi

Yeah. So if you look at this architecture, there are no long running servers in the data plane. What I mean is, if you look at a data warehousing architecture, let’s see when take Vertica, or any cloud warehouses, there’s a bunch of servers, which you do RPC on, right, and then a query hits them, there’s a node, which plans the query within that borrows in cluster distributes it, they’re able to, for example, cache a lot of metadata over in memory. So it can be much faster, to access meta data.

So in this architecture, these caches are sitting with either Hudi or within Presto, each layer is caching some of the parts of this, but, long running servers, which can do transaction management, can probably like give you more features. Like something like multivariable transactions, or let’s say a record level locks that you find in like traditional lock managers, all those are implemented in using some kind of like in memory locks. So we’ve given up some of these things.

But again, what the last four years of building this community out and supporting our use case, has shown them as for analytical workloads, we probably don’t need them as much. That’s what we learned from that.

Eric Kavanaugh | Bloor Group

Yeah, that’s great. That’s a really good insight, because it, it kind of explains what we’ve realized along the way, but to your point, and this is kind of what Stonebreaker was saying 16 years ago, he said look with the whole industry defaulted to a certain model – and he joked about it and asked it why it  happened – he took a dig at sales and marketing people. He’s like, “well the marketing people couldn’t get the messenger straight. It was just easier to say it this way.” I thought that was kind of funny. But his point was that you have different use cases. And like you just said, sometimes you don’t need these, these older services that we’ve grown reliant upon, especially if they’re not required in a in a particular use case.

So you’re enabling the sort of heterogeneity of use of the platform, which can be great in terms of performance for all kinds of people.

But let’s turn it over to Roy Hasson from AWS can you walk us through Glue and how that has evolved and how it’s flushing out the stack?

Roy Hasson | AWS

Yeah, so I mean, Glue itself and we’re, you know, in particular, the Glue Data Catalog is, you know, kind of started off as, as a hive compatible meta store that really tries to simplify the way that customers manage metadata in their in their environment. So we had a lot of customers on premise and even migrating to AWS. They’re managing these Hive meta stores on top of Amazon RDS, or self managed databases, and it’s a critical component of the of the entire solution. If you don’t have metadata, nothing is probably going to work and it’s not going to work well. But also, managing those databases is just, it’s a pain. There’s no need to do it, you have to build replication, etc, etc. It’s not something you really want to do. So when we created the Glue Data Catalog, we basically came in, we said, okay, it’s a critical component, we need to make sure that it works well. But we need to make it serverless so the customer doesn’t have to worry about it. And the integration was a really key aspect. We didn’t want to just create our own set of API’s and say, hey, everybody go ahead and integrate with it.

So we chose to be Hive compatible, in a sense that our API’s are very, very similar. So if you’re using Spark, or Presto, or whatever that may be, you can plug into the Glue catalog with a lot of development or a lot of complexity. So that was, I think that was the tipping point of the glue data catalog to say, now we can start plugging into more and more systems. So tools like Databricks, and Snowflake, and you know, of course, Ahana and lots of others are integrated with the Glue data catalog, which makes data access simpler. And I think the overall picture here and Dipti and Vinoth kind of talked about this – We’re talking about these technologies and breaking the monolith and making it more decoupled. So we can have scale and performance and cost across the board.

But the one thing that we have to remember, and I really I really believe that this stack helps there, is how do we make it simpler? Easier to use? Yes, there’s a lot of moving parts. Yes, there’s benefits to all of these things. So we don’t want to give that away. But we also don’t want to give away the ease of use that we need. And Hudi coming in and saying, We got the data, right, if stuff comes in to the lake, we’ll manage it for you. We’ll update we’ll insert, we’ll delete, we’ll do compaction, we’ll do all that stuff that in the past, you’d have to build ETL processes, I talked to customers all the time, I got this massive ETL job that copies data to a staging directory and then copies it to the production directory, it’s a pain customers have to stop and wait. Hudi just kind of does away with all these things. And that’s the benefit of this stack. Is that you get you get decoupling, you get scalability and performance. But you can say to the IT team, it’s covered. You don’t have to do anything, Hudi’s doing all this heavy lifting on the on the data side, Catalog is just managing all the metadata for you. And then Ahana is Presto, just plugs on top, and just queries the data. And if you want more to say hey, but I want to do some Query Federation, I want to extend I want to grow beyond just data in S3, Presto and Ahana makes it makes it that much easier to do.

So I think ease of use is kind of like the bow on top of this whole package makes it much easier for companies to consume.

Eric Kavanaugh | Bloor Group

And it really is a new way of – –

Dipti Borkar | Ahana

A couple of things to add to that, Eric, you know, one of the things that Vinoth said is, you know, we have a little bit more flexibility on the analytical side. And that’s, that’s actually important to understand. Because, you know, if I go back to this previous chart here, the original databases were built for OLTP workloads. So they were very rigid, these were business transactions, ACID compliance was extremely important. You needed no multiple levels of isolation, again, very important.

But on the analytics side, you can give a little bit, and that’s where, because of this flexibility, we are now able to disaggregate the system a little bit more and these workloads. And overtime that that will change, right, that’s evolving as well. The current state, we may not have all the asset compatibility that you get with an OLTD system, but that’s okay. We didn’t even have the ability to insert, update and delete. So now it’s at a point where not only is it simple to use, to Roy’s point, and I’ll talk about managed services at Cloud that has transformed that, as well. But it’s also this flexibility with these, because we’ve you know, given a little bit to some of these hard constraints, we are able to run these workloads on this new stack. Which wasn’t possible before. And so, that is kind of an important aspect.

Now, the ease of use is why you know, one of the reasons why I created Ahana it was very hard to do kind of SQL on S3. There with Hadoop very many, many different components. You know, many different projects, but when it’s all together, integrated into a marriage service, and just seamlessly fits in with other services like Glue – Lake formation and others, it just it makes the life of a data platform engineer much easier. Because you’re not managing the operations of the system, on your own. And that’s really important. Because we saw that with, with Hadoop, it took six months, nine months for even projects to complete. And even at that point, you didn’t really use the system fully, there was a lot of time that got spent on the in the operations of managing the system and, and the hive meta store, and all of these, all of these different aspects, but with this new world, and managed services, that has simplified the lives of platform engineers.

So you no longer need to, you know, everything comes kind of built in into the system, Hudi is doing its part is doing the compactions as needed, you can obviously, you know, schedule those and there’s API’s for things, Presto is doing its part where you can do you have auto scaling, you have the ability for cost management, if the cluster is not doing anything, it can go into idle state, Glue is managing its thing. And so that is very important. Because at the end of the day, when you have a stack, you want to see value from it right away. You want to see value of it, where you get insights from the data, that’s the outcome. The outcome matters. And because of these different managed services, they plug in well together. And now a three person Data Platform team, we have a lot of customers that are running this stack, they have a two or three person data platform team, and they are able to run this, which was impossible to do two, three years ago.

So the cloud has transformed and managed services have helped with the adoption of this stack. And go from a highly complicated, many different components, where you have to have figured out the integrations, all of it fits in well together now.

Eric Kavanaugh | Bloor Group

Yeah. Yeah, it’s very interesting, too, because it used to be that this one component was the constraint. And that was the choke point, basically throw this over to Vinoth Chandar of Hudi, and now we have much more versatility. But you’ve also to a certain extent, future proofed the architecture. Because I think one of the constraints of going with the old way, is that to do something new is very difficult and challenging. Ripping and replacing is always a very, very painful thing. Nobody wants to do that. So what we’ve kind of seen here is a whole separate architecture grow up around the existing systems. And now we’re mostly using this stuff for net new, but a lot of times you are using these new stacks for offload. So that traditional data warehousing workloads, now they really can work in this new environment that can ease some pressure on your data warehousing team.

I think the challenge is that from an organizational perspective, it’s not just technology, it’s people, its budgets, its human resources, it’s all these different things that amount to de facto constraints. And if you can, if you can expand the capability of the stack, while bringing down the number of people necessary, well, that really enables each individual to do a whole lot more.

What do you think about that Vinoth?

Vinoth Chandar | Hudi

Yeah, definitely. For example, in this model, right, going back to how we decoupled this access to storage. And for example, when you want interactive query performance, you deploy something like Presto, which has, long running servers, it can do a whole bunch of caching internally of data, metadata and speed of queries. But if you want to do, let’s say, your data science, or machine learning workloads, a good chunk of those workloads are data preparation, and kind of like more complex feature extraction, like ETL jobs, right? You are now able to access raw S3, like, you know, with very little overhead. Hudi, for example, be that lightweight transaction layer, which gives you the latest snapshot of a table. And you can actually, scan the data at S3 speeds, right? Without getting bottlenecked by a lot of server tear in front of it, like how you would do it if you were to say access data in Snowflake from another Spark cluster, right? So you have to pay for both. And then you’ll still be limited to the size of the let’s say, the barrows cluster.

This really does make it more general purpose architecture that can support both analytics, as well as the emerging data science machine learning workloads. That said, I think we’ve solved it fairly well for structured data, and maybe like semi structured data, so there’s a lot of data beyond this right which is like the Like we look at computer, vision and AI, and all of the growth that has happened there. I think very recently, TensorFlow, learned how to pack a file [inaudible]. So we’re still far away from doing this for a whole bunch of data that is not even tackled by warehouses at this point. And then this sets us up for a very nice future where you have open data. Then you have a lot of choice around what engines you want to pick at what price point and what the capabilities that the stack offers. It’s not just about performance in a lot of cases as well.

So yeah, I that’s why I believe in this stack as the is the kind of the future for how we like do data and industry.

Eric Kavanaugh | Bloor Group

Yeah, that’s very cool. And Roy, I’m going to bring you into this. I’ll go down memory lane here, again, which I always love to do, just kind of reexamining my own learning curve. And how we got here. And I remember, 20 years ago, working with a number of enterprise software companies, a good friend of mine was running one. And he was talking about metadata and metadata, cataloging, and so forth. And I said to you, I’ve noticed that all these different companies have their own metadata repository. And it’s good that they have that. But wouldn’t it be better if you had a sort of unified metadata repository that different companies could access to? And that expedite reuse of data and mixing and matching all this? And he kind of laughed, He’s like, Well, yeah, I guess so. But that’s not going to happen. And I kind of think that sort of now finally, happening in part because of the cloud. Because we have that we have the scale.

We have the resources, of course, Amazon Web Services, hats off to those folks for getting a 10 year head start in the competition. I don’t know how that happened, but it happened [inaudible].

I’ll throw it over to Roy to kind of comment on that. Are we finally getting to a point where that metadata management component is so robust, and so versatile, that we can stop reinventing wheels to a certain extent?

Roy Hasson | AWS

Yeah, I mean, I think generally speaking, there’s sort of two paths that are being taken by customers, the first one is really focused on data discovery, and search. And the second one is around data access and security and governance. Those two right now are still fairly separate tracks. But on the data discovery, data search, you’re seeing lots of open source tools like a Munson and Nemo, and you know, you name them, Data Hub, that are really focused on simplifying the discovery, the cataloguing of data, and then making them easy for users to come in and search and discover and understand and annotate and collaborate on these kind of things. But they’re not really solving the problem of data access and security.

We’ve had, we’ve had these tools around like Collibra, and Elation, they’ve been around for a long time. And they’ve done a really good job at providing this type of catalog. But when a user goes to query the data, let’s say to go to Ahana and do Select* from something. How do I satisfy that? I still need to be able to have a Hive meta store or some meta store that can serve the access needs. And that’s where Glue catalog really comes in and say, Hey, we’re going to solve the discovery and the cataloging aspect of data. But we’re also going to give you the option to access the data from your choice of tool. Now, of course, we have, we have lots of room to grow. It’s not perfect, you know, some of the tools out there, some have really awesome features and doing a really good job.

But again, right now, what I see in the market, as these two paths are kind of running in parallel, eventually, they’re going to start merging together. But customers are definitely seeing the value because there is more data, right? There’s a lot more data, there’s a lot more ways of accessing the data. So the bottleneck becomes finding it, and understanding it, and understanding, should I use this? Is this fine and stable? Good? Is it have good accuracy of the data? Is it something that Roy randomly created, and he has no idea how to do math? You know, so that’s something that you got to make sure that you do, right. And that’s why these catalogs are becoming more and more of a central focus for our customers.

But the other aspect to that I’ll kind of add into this as a security in the governance. Once you’ve done that, you know, you may go into the catalog and say, well, Roy can only see these specific tables, how do you actually enforce it? You still got to go back in and enforce someone Roy runs a query in Ahana, or in Athena, or in Redshift, I can consistently enforce those permissions without having to duplicate those policies in these systems. And that’s where the Glue Catalog with Lake Formation really comes in and centralizes all of that together.

Dipti Borkar | Ahana

We haven’t talked about security at all. I mean, there is a big box that fits around what we were looking at – the stack – on security, because as soon as you have a Lake, where you’re seeing all your data in the lake now, and across the enterprise, it’s streaming data, it’s enterprise data, it’s IOT data, it’s third party data, it’s all your data, you absolutely need governance on top.

There’s different approaches to that. And that goes to the operational catalog. So there’s, like Roy was saying, the operational catalog, the either the Give meta store, the Glue, it really is a mapping between databases, and tables, and objects. Because with SQL, you can’t query objects directly, you have to have some sort of a mapping. And that is that is kind of a foundational element. On top of that, there’s Apache Ranger, right, which has authorization. That’s coming up more and more, there’s obviously authentication mechanisms, like LDAP, and, you know, other SSO that needs to be built on. And then with managed services, like Lake Formation, they’re simplifying that by adding governance right on top of the storage layer, and taking those concerns down from the from the query engine or on the top of the stack to the [inaudible].

We’re seeing some innovations and movement in the stack, actually, that makes it a lot more flexible; for longer term integrations with other engines, with multiple different types of data processing on top. That’s an area that, you know, we’ve just kind of scratched the surface. So far, in the next two, three years, there will be more innovation in the security governance space with data lakes.

Eric Kavanaugh | Bloor Group

Now, that’s a really good point too Let’s dive into the synthesis, because I was thinking to myself, Dipti, there is this great German concept of Gestalt, which basically means the sum is greater than the whole of its parts. And again, here, you have groups working on specific components, but the magic happens when you bring them all together. Because what you’re doing is, you’re sort of resolving old problems with the new technology stack. But you do need to have that thorough vision from the top to the bottom.

Right, that’s kind of what you’re alluding to, I think, can you talk about the importance of, of appreciating the fullness of the stack and staying on top of the different components as the as they evolve?

Dipti Borkar | Ahana

I can go first. Absolutely. I mean, you know, it’s like one plus one is three in this case, you know, one plus one plus one plus one is 10. With the four components we’ve been talking about, it gives you at the end of the day, it’s the outcome? If you put yourself in the shoes of a data platform team, or a Data Platform Engineer, what we see is they’re looking for lower costs. Good enough performance, security, and the ability to run all the workloads that they have been running with the ability to run even more advanced workloads in the future. That’s what these, this stack essentially enables.

The query engine is getting more advanced as well where with Presto, you we are doing more push downs we’re doing we’re adding more of the more traditional, databases have been around for 30 plus 40 years, [inaudible] was written 1970. And there’s a lot of innovation from there that needs to be folded, just like the transaction piece is coming in with Hudi, the query engine themselves will start getting more advanced. There’s more that needs to be done there. As each of these layers become stronger and stronger, and can handle more of these workloads, we talked about transactionality acid compliance, we’ve kind of scratched the surface there, you will start to see a much bigger move to Data Lakes.

Today, it might be an augment strategy, where maybe you use a Snowflake or Redshift, right for 20% of your workloads for more reporting and dashboarding. And for use cases, you use Presto for the interactive ad hoc query analysis, some see some data science some you know SQL notebook kind of workloads you use you use spark for deep transformation ETL and others in the lake.

Over time, we’ll see a much bigger move as the stack evolves and becomes much more stronger. That’s how I see this evolving. I think OLTP will stay the same. There will be some high performing databases that will always be there. But from a warehouse perspective, the lake will consume a lot more workloads.

Eric Kavanaugh | Bloor Group 

Okay, good. And maybe, Vinoth, if you could explain in some more detail: What are the possible use cases for this transaction layer? What are some of the things that you’re now able to do in Hudi, that you used to be able to do in traditional database systems? Or still do quite frankly, traditional database systems? What are some of the examples of what this transaction layer can do for a company?

Vinoth Chandar | Hudi

I just score, right, like we mentioned, we saw previously, we were not able to do even single table transactions. And when we write to do the lake, like how we used to do with databases, that that’s the basic thing that we started solving. But specifically in Hudi we focused on actually adding indexing capabilities, and we have a file layout, which is very conducive for sort of, you know, fast inserts and deletes, when we designed Hudi, we wanted to make it almost be like OLTP (ish) performance for update, delete workloads. It’s still like batched. It’s not the same low performance as your regular multi database. But compared to where even [inaudible] and the data lake was, I think that’s one part that organizations can benefit from it.

While in 2016, when we started the project, we solved mutability and transactionality as a means to an end for solving incremental processing. Our biggest problem at Uber back then was we had all these like big batch jobs, we need to instrumentalize them. So right now, if I look back, I think we’ve solved incremental data ingestion pretty well. We can deploy something and it’s self-managing, works out of the box. The second part is what we bring, just like databases bring CDC, we bring a lot of CDC capabilities to Data Lakes, that is something that Hudi uniquely brings to the table. Right now in industry, where you have record level change streams, just you can consume with RDBMSs. What this opens us up is like a reimagination of data processing, using our kind of like an optimized storage layer like Hudi. Now you can do stream tables, like joints in the data lake.

Then these frameworks like Flink and Spark, and like Beam and they’re evolving to generalize batch processing as a, you know, in terms of the stream processing API’s, if you will. So that will be a very interesting next few years, as we go towards that, I believe a lot of batch processing, today, is still kind of reprocessing a lot of data. Typically the way we do batch processing is, you know, take last n partitions of the data, run it over, the last 10 partitions. So we literally added broad batch processing operations to Hudi late last year. We believed in this kind of incremental vision. But yeah, I think the things that are coming together now. Where organizations would be able to drop a lot of their data processing, like compute spend, by adopting a more incremental model.

And this is all made possible, not just the transactional capability, but the fact that we designed for, fast update, deletes, we can absorb delta changes quickly. And then also, hand out best in class CDC collect, log for other downstream data processing.

Eric Kavanaugh | Bloor Group

That’s fascinating. It’s really cool stuff. I’ll turn this one over to Roy, we have a couple of good questions from the audience. I’ll read a couple of them are talking about the Delta Lake too. So I’ll just read one to you, Roy, and then maybe Dipti, after Roy comments, you can comment on this too.

But speaking of open source projects, I can’t help to compare Apache Hudi versus Delta Lake. I assume AWS is leaning more towards Hudi since they made it available as a connector in the AWS Marketplace.

Can you talk about that from your perspective.

Roy Hasson | AWS

I’ll let Vinoth talk about kind of the differences between them, from a technical perspective, but from a from a positioning perspective, our intention is to support all popular formats by our customers. You know, Hudi works great with EMR, we made certain choices around including Hudi with Amazon EMR, we support it in Glue. Delta Lake is also supported. also works fine with EMR and you can make it work with Glue as well.

We continue to work with both the Hudi teams and the Delta Lake and also the Iceberg team to start building more integration for these formats into our services. Because customers are asking for them. So there’s no one good answer. And you know, of course we have our own that we announced – Lake Formation Governed Tables – to try to solve some of the complexities in some of the issues that exist today in the current formats.

It just gives customers options, but I’ll let Vinoth talk about the main differences.

Eric Kavanaugh | Bloor Group

Yeah, if you would Vinoth, go ahead.

Vinoth Chandar | Hudi

Yeah. So Delta Lake was with Hudi. Technically, I would I like to keep it short. So there’s technical, a lot of technical differences. One as, for example, Delta Lake supports what in Hudi, we call as copy on write storage where there’s a higher write amplification, but you pretty much work in parquet files. Whereas in Hudi you get more flexible, modern read kind of format, which lets you actually absorb updates, the leads come like asynchronously compact them later. And then the transactional model for something like Delta Lake is strictly optimistic concurrency control, which in my humble opinion, is not a great one lead choice when you have, long running transactions in the kind of the data lake ecosystem.

And so Hudi was designed with some like more MVCC based concurrency control, where just like a database, we differentiate between actual writers, external writers to the table, unlike internal processes, which are managing compaction and clustering. How a database would do coordination between cache manager like buffer pool manager and a locking thread, Hudi has that runtime around it.

As a project, we have significant like Apollo functionality that you get for free in open source, right, a good chunk of this functionality in Delta Lake is locked to the Databricks runtime. So with Hudi, you get all this for free. You can run it on any cloud you want on any like Spark server that you want. Even on Databricks, you can run all of this. That’s how I would say, Hudi, at this point is a more complete platform, if you will. I wanted to also take this point to like, talk a little bit about table format. So we literally put it up a blog today, clarifying what the product stands for. In the last few years, there’ll be a lot of activity around table format, in my opinion, at least, just building another format is a great step. It gets rid of a lot of bottlenecks in data access layers, like file listing, which can slow metadata. But honestly, if these were solved at meta store layer, formats don’t have to solve this within them.

So that’s sort of formats have their place, but I feel they’re still a means to an end, in the grand scheme of things, if we were to have the same level of usability and reliability and ease of use for a data lake users that they have the [inaudible], we need a more well integrated stack on top. And that’s what Hudi is building. Over time, we actually, for example, we had a ticket to plug Iceberg as an option under the Hudi runtime, if you will, for a while now. We are open to even working on top of other formats. Of course, users have to give up some benefits, like Iceberg only supports optimistic concurrency control. So your compaction will lock and will fail your ingestion and these things may happen. So as long as you’re, you’re okay with the trade offs.

I think over time, I feel like we should have more standard API’s across these formats and actually also build more layers on top in a cohesive way. That’s where I would push on the projects and where they are pretty different in terms of where they are going.

Eric Kavanaugh | Bloor Group

That’s fascinating stuff. I mean, really, it’s very, very interesting. Dipti, what, what advice do you have for folks to stay on top of what’s happening? Because I’ve been tracking open source now on and off for 15 years. And in the last three to four years, it’s borderline bewildering how much innovation is happening in different camps.

But in fact, one of the attendees asked an interesting question, that you can maybe riff off of, but the attendee is saying – what about all the other cloud environments where, you know, Google and Microsoft, are we going to see new walled gardens? Is this the new sort of Age of Empire, or instead of the old IBM, Oracle, SAP now we have Google, Microsoft, Amazon? How much of that analogy holds from your perspective?

Dipti Borkar | Ahana

Yeah, it’s a good question. Right? There were the three big database companies and now that’s changing. You still have data warehouses in each of these different clouds. But with the with these open source stacks, like Presto, that can log into Hudi, and we plan to add the other table transaction managers as well, like a Delta, etc. There’s Hudi  that plugs into multiple engines on the top and multiple format below it. Every layer, every component in the stack, the beauty of it is, it fits in with multiple components on the top, if there are components on the top, and multiple components below. And that’s what you get when you take open source foundation, oriented open source projects.

Apache will always be open. Apache Totoro license, it’s you’re never going to get locked in. Presto, but on Linux Foundation, it’s always going to be open. It’s a community driven project. Users have to think through what is important for them. Are they looking at just one solution that does everything, and may not do everything well? But does everything? It’s a kind of a traditional platform, approach.

Databricks is trying to go with that approach where it’s trying to solve everything in this data lake space. Or you could pick the best engine and the best tack for the four broad workloads that you’re trying to solve. So for example, with interactive querying, with a reporting and dashboarding federation on a lake, a Presto Hudi stack makes more sense. Transformation? Spark, might make more sense. For machine learning, TensorFlow might make more sense. So I would advise platform teams to think through, what is important for them? Performance characteristics, formats, open formats that they want to support in the storage layer? And, of course, which cloud? Because even though I think all of these are multi-cloud technologies, you can run them on every cloud. Most teams have a primary cloud. They have a primary cloud, and then they may have some secondary clouds. And so figure out for the cloud that you’re on, what is the best option? What is the best stack, right?

And this is, we’re mostly focused on AWS, is a great stack for AWS. For Google, it might be something else, they have data proc, you know, these layers feed into data proc as well, Presto, can run on data proc as well. And so those are some of the things that are bring up. That platform needs should think through Eric.

Eric Kavanaugh | Bloor Group

Okay, good. We have another good audience question I’ll throw over to Vinoth. One attendee is asking, Does Hudi fit in with the Cloudera stack? And how would you say it compares or complements Apache Kudu?

Vinoth Chandar | Hudi

I’m not super familiar with all of the chlorella stack, but high level? Yeah, it can run on top of HDFS or on top of – you can query from Impala, even today. It will run like the all our jobs will run on yarn. So I think it’s compatible with the Cloudera stack retriever. So the Kudu question is very interesting. In fact, that Kudu was something that we were evaluating at that time. Before we were writing Hudi. The thing with Kudu is, it still feels like a specialized data store. I would lump Kudu more in terms of, it needs SSD optimized storage for its, it kind of gives you like absurds, which are always, like more closer. I haven’t tested it fully.

But my understanding from the paper is that you have SSD storage, so you can they can optimize for both updates, as well as do scans better and stuff. So it feels like a specialized storage engine for analytics, evaluated together with, let’s say, the Druid the other real time analytics engines of the world and see how it fits together. Hudi is designed more for, hey, it is the general purpose data transaction, transactional storage layer for you, where you can manage all of your data, right forever. And this was the one of the reasons why we decided to write Hudi. Because we couldn’t really see how, at least for Uber, value that had, I couldn’t see how I would future proof this. Meaning, what if we were to move to a Cloud Object Storage in the future? Uber has 250 petabytes store like this today, or something like that.

I don’t want to run a Kudu cluster that big, or not just any other cluster that big. So again, going back this decoupled layer is awesome because you can build team was just scaling on managing all the data plane. And then there are separate teams who can bring in Presto, they can work on Presto, this whole model was, I think, much more scalable. I thought back in the day, and honestly, when I was trying to do it, I was alone. But over time, I’ve seen Delta Lake, did a similar thing. Which is, hey, here’s we’re going to decouple a transaction layer written in Spark. That’s kind of like how, what would we had already done before. So yeah, I think this model scales lot more for a general purpose  data layer, I would say.

Eric Kavanaugh | Bloor Group

That’s great. Well, folks, we burned through an hour here, I want to throw one last question at each of you to kind of summarize and be a bit forward looking perhaps, but Roy, I’ll start with you. I had this realization, a few months ago, I guess I’ve been thinking about it for a while, that we really are entering, we’re now in a new generation of enterprise computing. And I described as four generations basically,

First is the mainframe. Second would be client server. Third is original ASP generation application service providers, where you know, Salesforce was kind of the big poster child in that. And now what we’re seeing is this fourth generation, where you don’t want to go build some stack yourself to do all these different component parts, you want to let companies like Amazon, quite frankly, and even Uber and Google and these other major vendors, because they open source this stuff, let them build those services. And then you leverage those services.

So you build a sort of vision or a framework on top of these existing services that Amazon can provide, and others. And that’s a whole new way of doing things. Well, what do you think about that assessment? Roy?

Roy Hasson | AWS

Yeah, no, I think that I think that’s generally true. You know, AWS is building these building blocks. So folks like Dipti and Vinoth can build and innovate on top of that. We’re talking to our customers all the time, we’re trying to find the simplest and the best way to solve their problems. And sometimes we can do it with our native services. Sometimes we just create these building blocks. And then partners like Ahana come in and puts it all together. Puts a nice bow around it and say this is this is the best way to do it. So that’s, that’s a great ecosystem.

And I think, as we continue down this path of analytics, it really boils down to exposing and integrating. Exposing those API’s and integrating building these vertical solutions that claim to do everything is going back 20-30 years to the same problems we had with Oracle. Do we really want to do it now just in a shinier package? Not picking on anyone in particular, I’m just saying, we tend to sort of over bias on the ease of use and what the business needs to make it easier. And we tend to forget that when we’re building these solutions. They’re not there for a year, they’re there for years, 10-15 years. So we have to make sure that we’re also looking at the architecture, the implementation, the decoupling, because the scale problem will come back. Maybe we just kind of pushed it down the road a little bit. But the scale problem will come back and if you don’t have the right levers, you don’t have the right technology to solve those problems, you’re back to square one, just a few years down the road.

So again, with Hudi, I think it’s a great way to store the data. It’s a great way to simplify managing the data. But it also gives you portability. If tomorrow, you said hey, I want to go to a different bar, or I want to just move away from Hudi or something else, that it is an open format, you can just take and do whatever you want with it. It doesn’t lock you in. The same thing with the Glue Catalog. You know, if you want to use our service, fantastic.

If you don’t want to use it, you think something else is better, the metadata is there. There’s API’s. And I’ll say the same thing about Ahana and Athena, and all these other services. We’re using SQL, if you if you decide that you want to take your SQL somewhere else, because there’s a better, faster, cheaper engine, go ahead and do it. So this decoupling makes a lot of sense. And it’s going to save customers a ton of money and effort in the long run.

Eric Kavanaugh | Bloor Group

That’s a great point. Dipti closing thoughts from you.

Dipti Borkar | Ahana

Yeah, completely agree. Obviously, we’ve built a managed service on top of AWS, which is a incredible cloud, it gives you really all those building blocks. The control plane, we’ve used so many different aspects. We’ve Presto runs on Kubernetes, we’re using EKS. We have a serverless lambdas that are part of the control plane, so that it’s highly scalable. These building blocks allow vendors like Ahana to build new, newly designed, in the new world right control planes and out that take out some of this operational complexity and make it very easy to use. While giving users the flexibility of extending and scale at these different layers. So, if you have it’s three or four different components really, right, that’s what it’s come down to. And so it’s fairly manageable.

Given each layer is a managed service. It’s up and running in 30 minutes. In 30 minutes, you can do SQL on S3, integrating all of this stack. That is, that is the new world. And it’s, whether it’s AWS, or other clouds, the ease of use, and the flexibility of the openness is what it’s about. The way I define open data lakes, is its open format, or parquet, very important. You can move an engine move, move a transaction manager whenever you need to, you’re not locked in. As opposed to ingesting your data in some other data warehouse in another place.

Open interfaces, SQL lingua franca have a lot of tools out there, it is open, open source. You’re not getting locked into a proprietary engine of these layers on the top. Storage is S3, commoditized it 15 years now. And that’s taken care of, and then open clouds. So we’ll talk about AWS largely today. But you know, it is a multi-cloud and you the staff are drawn in different clouds as well.

Eric Kavanaugh | Bloor Group

I love it, I want to give Vinoth one chance to give his closing thoughts as well. What does the future hold for you? And for Hudi?

Vinoth Chandar | Hudi

Yeah, so I think we think we will continue to build out this, make this data plane, Open Data Plane, as easy to build as possible. Then reflecting on some of what was like dry in the dimension, right, I want to look at a little bit backward looking, if you look at all of a lot of the technologies that we use today Hive or Parquet or some of these things, these are actually born in the Hadoop era, if you will. But that’s the beauty of this model is that over time, these things are in open, when vendors don’t do a great job, others can step up. People can operate managed EMR, or Presto as a service. Then we have this opportunity to keep building towards this vision over a long period of time. As opposed to if we keep going down this path of closed walled garden, is the way of the future, then we know that right? Like big companies saturate after a point, innovation slows down. And then people move out.

So all of this will inevitably happen. And what will end up happening is the innovation across the board will be affected by that. So our goal here is to keep making this data plane better and better and better so that you can painlessly get started with bringing your data. and then use anything that you want. So that’s, that’s the principal and I think we have our cross cut out for the next one, two years in terms of kind of matching all the walled gardens in terms of the usability. I say this while realize you knock things on just openness, but that’s why we emphasize a lot on the well avail integrated stack inside the Hudi project, and we’ll continue to do so.

Eric Kavanaugh | Bloor Group

I’d love it will folks, look all these guys up online. This is the new stack. It’s another way of looking at how to build out your future. It’s always changing. But I think these folks understand the critical importance of interoperability, and componentization, if you will. So we do archive all these webinars. Thanks for joining us. Thanks, Dipti for the invite. Thank you, gentlemen, for joining us today. Great questions from the audience. I’m sure we’re going to have much more to talk about over the next few years. Thank you.

Dipti Borkar | Ahana

Great, thanks, everyone. Bye.

Speakers

Vinoth Chandar
Creator of Hudi

logo hudi
vinoth headshot

Roy Hasson
Principal Product Manager

aws logo color
roy hasson

Dipti Borkar
Cofounder & CPO

ahana logo
Dipti Borkar 1

Eric Kavanagh
Moderator

bloor group logo1
eric kavanaugh