Using Presto with a Hadoop cluster for SQL analytics is pretty common especially in on premise deployments.
With Presto, you can read and query data from the Hadoop datanodes but you can also make changes to data in Hadoop HDFS. There are however some restrictions.
All this is enabled via Presto’s Hive Connector.
The first step is to create a catalog properties file and point to the Hive Metastore.
You can also optionally configure some Hive Metastore properties for the Hive Connector.
Create etc/catalog/hive.properties with the following contents to mount the hive-hadoop2 connector as the hive catalog, replacing example.net:9083 with the correct host and port for your Hive metastore Thrift service:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://example.net:9083
For basic setups, Presto configures the HDFS client automatically and does not require any configuration files.
Creating a new table in HDFS via Hive
Using Presto you can create new tables via the Hive Metastore.
Examples
Create table:
Create a new Hive table named page_views in the web schema that is stored using the PARQUET file format, partitioned by date and country. HIVE is the name of the connector – that is the name of the properties file.
CREATE TABLE hive.web.page_views (
view_time timestamp,
user_id bigint,
page_url varchar,
ds date,
country varchar
)
WITH (
format = 'ORC',
partitioned_by = ARRAY['ds', 'country']
)
Deleting data from Hive / Hadoop
With the Hive connector, you can delete data but this has to be at the granularity of entire partitions.
Example:
Drop a partition from the page_views table:
DELETE FROM hive.web.page_views
WHERE ds = DATE '2016-08-09'
AND country = 'US'
Drop the external table request_logs. This only drops the metadata for the table. The referenced data directory is not deleted:
DROP TABLE hive.web.request_logs
Drop a schema:
DROP SCHEMA hive.web
Hive Connector Limitations
- DELETE is only supported if the WHERE clause matches entire partitions.
- UPDATE is not supported from Presto