Advanced SQL Tutorial
Advanced SQL: JSON

Presto has a wide-range of JSON functions supporting advanced SQL queries. Consider this Json test input data (represented in the query using the VALUES function) which contains 3 key/value elements. The key is “name” and the value is a dog breed. If we want to select the the first (0th) key/value pair we would code:
SELECT json_extract(v, '$.dogs) AS all_json,
json_extract(v, '$.dogs[0].name') AS name_json,
json_extract_scalar(v, '$.dogs[0].name') AS name_scalar
FROM
(VALUES JSON ' {"dogs": [{"name": "Beagle"}, {"name": "Collie"}, {"name": "Terrier"}]} ') AS t (v);
all_json | name_json | name_scalar
----------------------------------------------------------+-----------+-------------
[{"name":"Beagle"},{"name":"Collie"},{"name":"Terrier"}] | "Beagle" | Beagle
(1 row)
All of Presto’s JSON functions can be found at: https://prestodb.io/docs/current/functions/json.html
Advanced SQL: Arrays, Un-nesting, and Lambda functions
Consider the following array of test data elements, and simple query to multiple each element by 2:
SELECT elements,
ARRAY(SELECT v * 2
FROM UNNEST(elements) AS v) AS my_result
FROM (
VALUES
(ARRAY[1, 2]),
(ARRAY[1, 3, 9]),
(ARRAY[1, 4, 16, 64])
) AS t(elements);
elements | my_result
----------------+---------------------
[1, 2] | [2, 4]
[1, 3, 9] | [2, 6, 18]
[1, 4, 16, 64] | [2, 8, 32, 128]
(3 rows)
The above advanced SQL query is an example of nested relational algebra which provides an fairly elegant and unified way to query and manipulate nested data.
Now here’s the same query, but written using a lambda expression. Why use lambda expressions? This method makes advanced SQL querying nested data less complex and the code simpler to read/develop/debug, especially when logic gets more complicated:
SELECT elements,
transform(elements, v -> v * 2) as my_result
FROM (
VALUES
(ARRAY[1, 2]),
(ARRAY[1, 3, 9]),
(ARRAY[1, 4, 16, 64])
) AS t(elements);
Both queries return the same result. The transform function and “x -> y” notation simply means “do y to my variable x”.
To see more lambda expression examples check out: https://prestodb.io/docs/current/functions/lambda.html
Advanced SQL: Counting Distinct Values
Running a count(distinct xxx) function is memory intensive and can be slow to execute on larger data sets. This is true for most databases and query engines. The Presto-cli will even display a warning reminding you of this.
A useful alternative is to use the approx_distinct function which uses a different algorithm (the HyperLogLog algorithm) to estimate the number of distinct values. The result is an approximation and the margin of error depends on the cardinality of the data. The approx_distinct function should produce a standard error of up to 2.3% (but it could be higher with unusual data).
Here’s an example comparing distinct and approx_distinct with a table containing 160.7 million rows. Data is stored in S3 as Parquet files and the Presto cluster has 4 workers. We can see approx_distinct is more than twice as fast as count(distinct xxx):
presto:amazon> select count(distinct product_id) from review;
_col0
----------
21460962
(1 row)
WARNING: COUNT(DISTINCT xxx) can be a very expensive operation when the cardinality is high for xxx. In most scenarios, using approx_distinct instead would be enough
Query 20201231_154449_00058_npjtk, FINISHED, 4 nodes
Splits: 775 total, 775 done (100.00%)
0:56 [161M rows, 1.02GB] [2.85M rows/s, 18.4MB/s]
presto:amazon> select approx_distinct(product_id) from review;
_col0
----------
21567368
(1 row)
Query 20201231_154622_00059_npjtk, FINISHED, 4 nodes
Splits: 647 total, 647 done (100.00%)
0:23 [161M rows, 1.02GB] [7.01M rows/s, 45.4MB/s]
Related Articles
A Comprehensive Guide to Data Warehouse Types
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. Learn more about what these data warehouse types are and the benefits they provide to data analytics teams within organizations..
Presto vs Snowflake: Data Warehousing Comparisons
Presto is an open-source SQL query engine, developed by Facebook, for large-scale data lakehouse analytics. Snowflake is a cloud data warehouse that offers a cloud-based information storage and analytics service. Learn more about the differences between Presto and Snowflake in this article.