How do I transfer data from a Hadoop / Hive cluster to a Presto cluster?

Hadoop is a system that manages both compute and data together. Hadoop cluster nodes have the HDFS file system and may also have different types of engines like Apache Hive, Impala or others running on the same or different nodes. 

In comparison, Presto, an open source SQL engine built for data lakes, is only a query engine. This means that it does not manage its own data. It can query data sitting in other places like HDFS or in cloud data lakes like AWS S3. 

Because of this there is no data transfer or ingestion required into Presto for data that is already residing in an HDFS cluster. Presto’s Hive Connector was specifically designed to access data in HDFS and query it in Presto. The Hive connector needs to be configured with the right set of config properties 

The Presto Hive connector supports Apache Hadoop 2.x and derivative distributions including Cloudera CDH 5 and Hortonworks Data Platform (HDP).

Create etc/catalog/hive.properties with the following contents to mount the hive-hadoop2 connector as the hive catalog, replacing example.net:9083 with the correct host and port for your Hive metastore Thrift service:

connector.name=hive-hadoop2

hive.metastore.uri=thrift://example.net:9083

For basic setups, Presto configures the HDFS client automatically and does not require any configuration files. In some cases, such as when using federated HDFS or NameNode high availability, it is necessary to specify additional HDFS client options in order to access your HDFS cluster. To do so, add the hive.config.resources property to reference your HDFS config files:

hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml