How to use mathematical functions and operators and aggregate functions for Presto?

Presto offers several classes of mathematical functions that operate on single values and mathematical operators that allow for operations on values across columns. In addition, aggregate functions can operator on a set of values to compute a single result.

The mathematical functions are broken into four subcategories: 1. mathematical, 2. statistical, 3. trigonometric, and 4. floating point. The majority fall into the mathematical category and we’ll discuss them separately. The statistical functions are quite sparse with two functions that compute the lower and upper bound of the Wilson score interval of a Bernoulli process. The trigonometric functions are what you’d expect (e.g. sin, cos, tan, etc.). The floating point functions are really functions that handle not-a-number and infinite use cases.

The mathematical functions subcategory further fall into another layer of classification:

  1. Functions that perform coarser approximation, such as rounding and truncation: abs, ceiling (ceil), floor, round, sign, truncate
  2. Conversions: degrees, radians, from_base, to_base
  3. Exponents, logarithms, roots: exp, ln, log2, log10, power (pow), cbrt, sqrt
  4. Convenient constants, such as pi(), e(), random (rand)
  5. Cumulative distribution functions (and inverses):binomial_cdf, inverse_binomial_cdf, cauchy_cdf, inverse_cauchy_cdf, chi_squared_cdf, inverse_chi_squared_cdf, normal_cdf, inverse_normal_cdf, poisson_cdf, inverse_poisson_cdf, weibull_cdf, inverse_weibull_cdf, beta_cdf, inverse_beta_cdf, width_bucket
  6. Miscellaneous: mod, cosine_similarity

The mathematical operators are basic arithmetic operators, such as addition (+), subtraction (-), multiplication (*), and modulus (%).

Let’s apply these mathematical functions in an example. In the following query, have a floating-point column x to which we apply several mathematical functions that are representative of the subcategories we discussed previously, including: radians (conversion), natural log, the Normal CDF, modulo, random number, and operators.

select
	x,
	radians(x) as radians_x,							/* convert to radians */
	ln(x) as ln_x,												/* natural log */
	normal_cdf(0,30,x) as_normal_cdf_x,		/* Normal CDF */
	mod(x,2) as mod_x_2,									/* Modulo 2 */
	random() as r,												/* Random number */
	3*((x/2)+2) as formula								/* Formula using operators */
from
	example;

The following is the output the above query with some rounding for ease of viewing.

So, far we see that mathematical functions, as they are classified in Presto, operate on single values. What this means is that given a column of values, each function is applied element-wise to that column. Aggregate functions allow us to look across a set of values.

Like mathematical functions, aggregate functions are also broken into subcategories: 1. general, 2. bitwise, 3. map, 4. approximate, 5. statistical, 6. classification metrics, and 7. differential entropy. We will discuss the general and approximate subcategory separately.

The bitwise aggregate functions are two functions that return the bitwise AND and bitwise OR or all input values in 2’s complement representation. The map aggregate functions provide convenient map creation functions from input values. The statistical aggregate functions are standard summary statistic functions you would expect, such as stddev, variance, kurtosis, and skewness. The classification metrics and differential entropy aggregate functions are specialized functions that make it easy to analyze binary classification predictive modelling and model binary differential entropy, respectively.

The general functions subcategory further fall into another layer of classification:

  1. Common summarizations: count, count_if, min, max, min_by, max_by, sum, avg, geometric_mean, checksum
  2. Boolean tests: bool_or, bool_and, every
  3. Data structure consolidation: array_agg, set_agg, set_union
  4. Miscellaneous: reduce_agg, arbitrary

Again, let’s apply these aggregate functions in a series of representative examples. In the following query, we apply a series of basic aggregations to our floating-point column x .

select
	sum(x) as sum_x,
	count(x) as count_x,
	min(x) as min_x,
	max(x) as max_x,
	avg(x) as avg_x,
	checksum(x) as ckh_x
from
	example;

The following is the output the above query.

In the following query, we showcase a boolean test with the bool_or function. We know that the natural log will return a NaN for negative values of x. So, if we apply the is_nan check, we expect x to always be false, but for our ln result to occasionally be true. Finally, if we were to do the bool_or aggregation on our is_nan functions, we expect the column derived from x to be false (i.e. no true at all) and the column derived fro ln(x) to be true (i.e. at least one true value). The following query and accompanying result illustrate this.

with nan_test as (
	select
		is_nan(x) as is_nan_x,
		is_nan(ln(x)) as is_nan_ln_x
	from
		example
)
select
	bool_or(is_nan_x) as any_nan_x_true,
	bool_or(is_nan_ln_x) as any_nan_ln_x_true
from
	nan_test;

This final example illustrates the use of an example of data consolidation, taking a x and radians(x) columns and creating a single row with a map data structure.

with rad as(select x, radians(x) as rad_x from example)
select map_agg(x, rad_x) from rad;

The approximate aggregate functions provide approximate results for aggregate large data sets, such as distinct values (approx_distinct), percentiles (approx_percentile), and histograms (numeric_histogram). In fact, we have a short answer post on how to use the approx_percentile function. Several of the approximate aggregate functions rely on other functions and data structures: quantile digest, HyperLogLog and KHyperLogLog.

A natural extension to aggregate functions are window functions, which perform calculations across rows of a query results. In fact, all aggregate functions can be used as window functions by adding an OVER clause. One popular application of window functions is time-series analysis. In particular, the lag function window function is quite useful. We have a short answer post on how to use the lag window function and to compute differences in dates using the lag window function.

This was short article was a high-level overview, and you are encouraged to review the Presto public documentation for Mathematical Functions and Operations and Aggregate Functions. If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.