Presto offers several classes of mathematical functions that operate on single values and mathematical operators that allow for operations on values across columns. In addition, aggregate functions can operator on a set of values to compute a single result.
The mathematical functions are broken into four subcategories: 1. mathematical, 2. statistical, 3. trigonometric, and 4. floating point. The majority fall into the mathematical category and we’ll discuss them separately. The statistical functions are quite sparse with two functions that compute the lower and upper bound of the Wilson score interval of a Bernoulli process. The trigonometric functions are what you’d expect (e.g.
tan, etc.). The floating point functions are really functions that handle not-a-number and infinite use cases.
The mathematical functions subcategory further fall into another layer of classification:
- Functions that perform coarser approximation, such as rounding and truncation:
- Exponents, logarithms, roots:
- Convenient constants, such as
- Cumulative distribution functions (and inverses):
The mathematical operators are basic arithmetic operators, such as addition (
+), subtraction (
-), multiplication (
*), and modulus (
Let’s apply these mathematical functions in an example. In the following query, have a floating-point column
x to which we apply several mathematical functions that are representative of the subcategories we discussed previously, including: radians (conversion), natural log, the Normal CDF, modulo, random number, and operators.
select x, radians(x) as radians_x, /* convert to radians */ ln(x) as ln_x, /* natural log */ normal_cdf(0,30,x) as_normal_cdf_x, /* Normal CDF */ mod(x,2) as mod_x_2, /* Modulo 2 */ random() as r, /* Random number */ 3*((x/2)+2) as formula /* Formula using operators */ from example;
The following is the output the above query with some rounding for ease of viewing.
So, far we see that mathematical functions, as they are classified in Presto, operate on single values. What this means is that given a column of values, each function is applied element-wise to that column. Aggregate functions allow us to look across a set of values.
Like mathematical functions, aggregate functions are also broken into subcategories: 1. general, 2. bitwise, 3. map, 4. approximate, 5. statistical, 6. classification metrics, and 7. differential entropy. We will discuss the general and approximate subcategory separately.
The bitwise aggregate functions are two functions that return the bitwise
AND and bitwise
OR or all input values in 2’s complement representation. The map aggregate functions provide convenient map creation functions from input values. The statistical aggregate functions are standard summary statistic functions you would expect, such as
skewness. The classification metrics and differential entropy aggregate functions are specialized functions that make it easy to analyze binary classification predictive modelling and model binary differential entropy, respectively.
The general functions subcategory further fall into another layer of classification:
- Common summarizations:
- Boolean tests:
- Data structure consolidation:
Again, let’s apply these aggregate functions in a series of representative examples. In the following query, we apply a series of basic aggregations to our floating-point column
select sum(x) as sum_x, count(x) as count_x, min(x) as min_x, max(x) as max_x, avg(x) as avg_x, checksum(x) as ckh_x from example;
The following is the output the above query.
In the following query, we showcase a boolean test with the
bool_or function. We know that the natural log will return a
NaN for negative values of
x. So, if we apply the
is_nan check, we expect
x to always be
false, but for our
ln result to occasionally be
true. Finally, if we were to do the
bool_or aggregation on our
is_nan functions, we expect the column derived from
x to be
false (i.e. no
true at all) and the column derived fro
ln(x) to be
true (i.e. at least one true value). The following query and accompanying result illustrate this.
with nan_test as ( select is_nan(x) as is_nan_x, is_nan(ln(x)) as is_nan_ln_x from example ) select bool_or(is_nan_x) as any_nan_x_true, bool_or(is_nan_ln_x) as any_nan_ln_x_true from nan_test;
This final example illustrates the use of an example of data consolidation, taking a
radians(x) columns and creating a single row with a map data structure.
with rad as(select x, radians(x) as rad_x from example) select map_agg(x, rad_x) from rad;
The approximate aggregate functions provide approximate results for aggregate large data sets, such as distinct values (
approx_distinct), percentiles (
approx_percentile), and histograms (
numeric_histogram). In fact, we have a short answer post on how to use the
approx_percentile function. Several of the approximate aggregate functions rely on other functions and data structures: quantile digest, HyperLogLog and KHyperLogLog.
A natural extension to aggregate functions are window functions, which perform calculations across rows of a query results. In fact, all aggregate functions can be used as window functions by adding an
OVER clause. One popular application of window functions is time-series analysis. In particular, the
lag function window function is quite useful. We have a short answer post on how to use the
lag window function and to compute differences in dates using the
lag window function.
This was short article was a high-level overview, and you are encouraged to review the Presto public documentation for Mathematical Functions and Operations and Aggregate Functions. If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.