Q&A from the Panel
How’s Presto used at your company?
Uber: Presto is very widely used. It powers analytics on our data lake. And by last count, I think about half the company logs into Presto at least once a month to do something that’s important for their work. The power of this system comes out in some use cases that we have at Uber. So for instance, we have Pinot as a real time analytics tool. And we have Hadoop as our big data lake where we store our historical data. And through, for instance, Apache Superset, we are able to build a single dashboard, which allows our users to look at how the real time business is doing as well as what the historical trends are in a single place using a single interface, all powered by Presto. This highlights the incredible power that Presto is able to bring to the data analytics.
Alibaba: Alibaba is the largest cloud service provider in China and in the Asia market. And so from the very beginning of the Alibaba Cloud, we do embrace and contribute back to the open source community to build up a lot of our services on cloud, for example, databases and big data analytics, all those are services on cloud. As for Presto, we leverage Presto to several of our key products on the cloud, for example, our Data Lake Analytics, which provides serverless federated analytics. We’re in a Big Data time, right. But now we think we are moving to the Fast Data Era which requires a lot of more effort to do analytics to different data sources in the real time manner.
Why is Presto so important and how does it help today’s user?
Bloor Group: The concept of a data lake came around in the big data, evolution. And now what you see is you will have companies that have a data warehouse or two or five, they’ll have a data lake or two or three, they’ll have lots of different information systems that they’re looking to leverage into use, but it no longer makes sense. You can’t pull all your data into a warehouse. You can’t even realistically pull up your data into a data lake and then expect good performance. So what’s been happening is Presto is this federated query engine that allows you to access all these different information sources where they are, and get that truly strategic view of what’s going on in your business.
Ahana: Data engineers spend a lot of time building pipelines to copy data from one source to another, making it accessible from one source to another, whether it’s from their operational system to s3, or a data warehouse, etc. 60% of their time is spent in moving data around. And what Presto does, is it beautifully solves this problem by being able to query in place and pushing some of the the core query logic, if you will, we call it push downs, to the data sources, pull that data in memory, and process it and return it back to the analyst or the technical product manager or the data driven person at the other end trying to make a decision.
In place querying and the ability to federate across many data sources and to be able to join across them is a very hard problem to solve, and Presto solves it very well because it’s based on ANSI SQL. And that’s one of the reasons it’s also become so popular is because you don’t have to relearn or re-integrate existing products that are out there, it already comes with standard JDBC/ODBC drivers, and it just works out of the box. In terms of the architecture, you get access to all these data sources. It’s processed in memory, as opposed to some of the previous generation technologies that were more disk based. That’s the other aspect that’s important, which is it’s an in memory system, with the ability to, to spill to disk, etc, for larger data sets. And the speed is important.
Uber: Presto abstracts away the user interface that analysts and data scientists use to understand the data using tools like Superset, Jupyter, Tableau, etc. It gives a single point of contact for our users using these front end interfaces to query data on all these systems. But even more important than a single point of contact, I think is a single abstraction layer. And by that what I mean is that you can query data on all these data sources using a single query language, which is Presto, and a single view of the data, which is basically a set of definitions within Presto, where you can take data from, let’s say, Hadoop, HDFS and Pinot and join them as if they were sitting in a single database, right. And Presto makes this all possible through the abstraction layers that it builds.
How can a developer get involved in the Presto project and community?
Uber: Through the website and the GitHub repository. You can download the code, it’s free, you can take a look at the code, it’s open source. And you can make modifications and contributions to the code by submitting a code request/pull request over GitHub. – you can just put them in, get it reviewed by a committer and make that contribution. And in the next release, your changes are there and you can just use them, and it’s available for everybody else to use. The GitHub repository, the community process, is the primary means where the vast majority of developers can engage.
Now if you want to get more involved, if you want to, for instance, be involved in a technical direction. Maybe you have some opinions or some use cases where deeper changes are required in the engine itself. Get involved with the technical steering committee – the meetings are open to the public. As you build your engagement with the community and your contributions to the code, you can even get into a conservatorship where you’re recognized as an engineering leader in the community. And you can also be part of the technical steering committee and help drive this direction if this is something that you want to do and engage with on a more full time basis. So in addition to these two venues, we also organize a lot of community events and engagements and seminars and talks, which are great ways to come and interact directly with other folks in the community.
Engineering Manager – Data Infrastructure
Co-Founder and Chief Product Officer
Sr. Staff Engineer