Advanced SQL Tutorial

JSON

Presto has a wide-range of JSON functions. Consider this json test input data (represented in the query using the  VALUES function) which contains 3 key/value elements. The key is “name” and the value is a dog breed. If we want to select the  the first (0th) key/value pair we would code:

SELECT json_extract(v, '$.dogs) AS all_json, 
  json_extract(v, '$.dogs[0].name') AS name_json, 
  json_extract_scalar(v, '$.dogs[0].name') AS name_scalar 
FROM 
(VALUES JSON ' {"dogs": [{"name": "Beagle"}, {"name": "Collie"}, {"name": "Terrier"}]} ') AS t (v);
 
                         all_json                         | name_json | name_scalar 
----------------------------------------------------------+-----------+-------------
 [{"name":"Beagle"},{"name":"Collie"},{"name":"Terrier"}] | "Beagle"  | Beagle      
(1 row)

All of Presto’s JSON functions can be found at: https://prestodb.io/docs/current/functions/json.html 

Arrays, Unnesting, and Lambda functions 

Consider the following array of test data elements, and simple query to multiple each element by 2:

SELECT elements,
    ARRAY(SELECT v * 2
          FROM UNNEST(elements) AS v) AS my_result
FROM (
    VALUES
        (ARRAY[1, 2]),
        (ARRAY[1, 3, 9]),
        (ARRAY[1, 4, 16, 64])
) AS t(elements);
 
    elements    | my_result
----------------+---------------------
 [1, 2]         | [2, 4]
 [1, 3, 9]      | [2, 6, 18]
 [1, 4, 16, 64] | [2, 8, 32, 128]
(3 rows)

The above query is an example of nested relational algebra which provides an fairly elegant and unified way to query and manipulate nested data. 

Now here’s the same query, but written using a lambda expression. Why use lambda expressions?  This method makes querying nested data less complex and the code simpler to read/develop/debug, especially when logic gets more complicated:

SELECT elements, 
transform(elements, v -> v * 2) as my_result
FROM (
    VALUES
        (ARRAY[1, 2]),
        (ARRAY[1, 3, 9]),
        (ARRAY[1, 4, 16, 64])
) AS t(elements);

Both queries return the same result. The transform function and “x -> y” notation simply means  “do y to my variable x”.

To see more lambda expression examples check out: https://prestodb.io/docs/current/functions/lambda.html 

Counting Distinct Values

Running a count(distinct xxx) function is memory intensive and can be slow to execute on larger data sets. This is true for most databases and query engines.  The Presto-cli will even display a warning reminding you of this.  

A useful alternative is to use the approx_distinct function which uses a different algorithm (the HyperLogLog algorithm) to estimate the number of distinct values.  The result is an approximation and the margin of error depends on the cardinality of the data. The approx_distinct function should produce a standard error of up to 2.3% (but it could be higher with unusual data). 
Here’s an example comparing distinct and approx_distinct with a table containing 160.7 million rows.  Data is stored in S3 as Parquet files and the Presto cluster has 4 workers. We can see approx_distinct is more than twice as fast as count(distinct xxx):

presto:amazon> select count(distinct product_id) from review;
 
  _col0   
----------
 21460962 
(1 row)
 
WARNING: COUNT(DISTINCT xxx) can be a very expensive operation when the cardinality is high for xxx. In most scenarios, using approx_distinct instead would be enough
 
 
Query 20201231_154449_00058_npjtk, FINISHED, 4 nodes
Splits: 775 total, 775 done (100.00%)
0:56 [161M rows, 1.02GB] [2.85M rows/s, 18.4MB/s]
 
presto:amazon> select approx_distinct(product_id) from review;
  _col0   
----------
 21567368 
(1 row)
 
Query 20201231_154622_00059_npjtk, FINISHED, 4 nodes
Splits: 647 total, 647 done (100.00%)
0:23 [161M rows, 1.02GB] [7.01M rows/s, 45.4MB/s]