SQL Server Big Data

A SQL server database engine can be used for accessing and analyzing big data. A SQL server includes both a storage engine and a query processor, which is also known as a query engine. Interactive, ad-hoc analytics on big data can be done via a SQL query engine such as Presto, which is  open-source and federated. Presto lets you query data where it is stored, including Hive, AWS S3, Hadoop, Cassandra, relational databases, NoSQL databases, or even proprietary data stores. With Presto, you can access data from a variety of sources, so you can garner analytics across an entire organization. 

Presto is an extra compute layer for faster analytics. Since it doesn’t store the data, it has the advantage of being able to scale resources  based on the demand. Presto lets you query data where it is stored, including Hive, AWS S3, Hadoop, Cassandra, relational databases, NoSQL databases, or even proprietary data stores. Separating compute and storage makes the Presto query engine a good fit for cloud environments. You can run Presto in AWS, Azure and Google Cloud environments.

There are two kinds of Presto servers. There are coordinators and there are workers. The coordinator receives the SQL statements to execute and then it parses those statements, plans out the queries, and manages the workers. The coordinator decides which workers can execute the tasks and process the data. Workers let the coordinators know when they are available. Once the worker is done with its task, the coordinator gathers the results from all the workers and then sends the final result to the client.

Presto is a high performance, distributed SQL query engine for big data. 

What it isn’t: Although the official github repo for the open source project under the auspices of The Presto Foundation is called PrestoDB, it is not a database. Users can’t store data in Presto, and it can’t replace relational databases such as PostgreSQL, Oracle, or MySQL. Presto stores intermediate data in its buffer cache, but it’s not meant to be used as a persistent storage layer.  Further, Presto is not designed to handle online transaction processing (OLTP).

What it is: Presto is an open source, distributed SQL query engine that’s best used for running interactive analytic queries on data sources of all sizes. Presto enables users to query data right at its location, whether the data is on Hive, Cassandra, relational databases or proprietary data stores. Users can combine data from multiple sources to be used in a single SQL query, making it possible to analyze data across the entire organization.

The PrestoDB project is community owned and driven and is supported by the Presto Foundation, which is an independent, nonprofit organization that’s hosted under the Linux Foundation. Founding members of the Presto Foundation include Facebook, Alibaba, Twitter and Uber. These members are helping drive the future direction of the project, with the goal of making Presto the most reliable SQL engine for massively distributed data processing.

Presto is particularly well-suited for platform teams who want to provide self-service analytics to their teams. It supports a broad range of use cases including ad hoc querying with SQL at any time, wherever the data resides; data lake analytics for querying directly on a data lake without the need for transformation; and federated querying, i.e. the ability to query data across multiple sources such as databases, data lakes, and more.

A full Presto deployment includes a coordinator and multiple workers. Analysts can submit their queries via a client such as the Presto CLI to the coordinator. The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers.