How do I use the approx_percentile function in Presto?


The Presto approx_percentile is one of the approximate aggregate functions, and it returns an approximate percentile for a set of values (e.g. column). In this short article, we will explain how to use the approx_percentile function.

What is a percentile?

percentile function

From Wikipedia:

In statistics, a percentile (or a centile) is a score below which a given percentage of scores in its frequency distribution falls (exclusive definition) or a score at or below which a given percentage falls (inclusive definition)

To apply this, we’ll walk through an example with data points from a known, and arguably most famous, distribution—-the Normal (or Gaussian) distribution. The adjacent diagram plots the density of a Normal distribution with a mean of 100 and standard deviation of 10.

If we were to sample data points from this Normal distribution, we know that approximately half of the data points would be less that the mean and half of the data points would be above the the mean. Hence, the mean, or 100 in this case, would be the 50th percentile for the data. It turns out that the 90th percentile would approximately be 112.82; this means that 90% of the data points are less than 112.82.

approx_percentile by example

To solidify our understanding of this and the approx_percentile function, we’ve created a few tables to use as example:

presto:default> show tables;
(4 rows)
TableDescriptionNumber of Rows
dummySingle column 100 row table of all ones except for a single value of 100.100
norm_0_1Samples from normal distribution with mean of 0 and standard deviation of 1.5000
norm_100_10Samples from normal distribution with mean of 100 and standard deviation of 10.5000
norm_allCoalescence of all normal distribution tables.10000
Table 1

The approx_percentile function has eight type signatures. You are encouraged to review the Presto public documentation for all the function variants and official descriptions. The set of values (e.g. column) is a required parameter and is always the first argument.

Another required parameter is the percentage parameter, which indicates the percentage or percentages for the returned approximate percentile. The percentage(s) must be specified as a number between zero and one. The percentage parameter can either be the second or third argument of the function, depending on the intended signature. In the following examples, the percentage parameter will be the second argument. For example, approx_percentile(x,0.5) will return the approximate percentile for column x at 50%. For data points in our norm_100_10 table, we expect the returned value to be around 100.

presto:default> select approx_percentile(x,0.5) from norm_100_10;
(1 row)

approx_percentile(x,0.9) will return the approximate percentile for column x at 90%, which for the data in norm_100_10 table should be around 112.82.

presto:default> select approx_percentile(x,0.9) from norm_100_10;
(1 row)

In a single query, you can also specify an array of percentages to compute percentiles: approx_percentile(x,ARRAY[0.5, 0.9]).

presto:default> select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_100_10;
 [99.8184647799587, 112.692881777202]
(1 row)

We can ask for multiple percentages for our dummy table, which consists of a 100 rows of all ones except for a single value of 100. Hence, we expect all percentiles below 99% to 1.

presto:default> select approx_percentile(x,ARRAY[0.1, 0.5, 0.98, 0.99]) from dummy;
 [1.0, 1.0, 1.0, 100.0]
(1 row)

We can also use a GROUP BY clause to segment the values to compute percentiles over. To illustrate this, we will use our norm_all table, which contains values from both the norm_100_10 and the norm_0_1 tables. The m and sd columns specify the mean and standard deviation of the normal distribution the corresponding x value is sampled from.

presto:default> select m, sd, x from norm_all order by rand() limit 10;
  m  | sd |         x
 0   | 1  | -0.540796486700647
 0   | 1  |   0.81148151337731
 0   | 1  |   1.28976310661005
 100 | 10 |   97.0272872801269
 100 | 10 |   83.1392343835652
 0   | 1  | -0.585678877703149
 0   | 1  |  0.268589447255106
 0   | 1  | -0.280908719376113
 100 | 10 |    104.36328077332
 0   | 1  |  0.266294347905949
(10 rows)

The following query then will return approximate score for 50% and 90% for data points grouped by the same values of m and sd (i.e. from the same normal distribution): select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_all group by grouping sets ((m,sd)). As expected, we see that the approximate 50% and 90% percentile is around 100 and 112.82 for a mean of 100 and standard deviation of 10 and around 0 and 1.28 for a mean of 0 and standard deviation of 1.

presto:default> select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_all group by grouping sets ((m,sd));
 [99.8563616481321, 112.879972343696]
 [-0.00458419083839064, 1.30949677294588]
(2 rows)

An optional parameter is accuracy, which controls the maximum rank error and defaults to 0.01. The value of accuracy must be between zero and one (exclusive) and must be constant for all input rows. We can add accuracy as the third argument to our function.

For example, approx_percentile(x,0.9,0.5) will return the approximate score for column x at 90% with 0.5 maximum rank error accuracy. By allowing for larger error (from the 0.01 default), we can see the approximate score of 113.50 is farther away from our true value of 112.82 than our previous result of 112.88.

presto:default> select approx_percentile(x,0.9,0.5) from norm_100_10;
(1 row)

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.