The Presto `approx_percentile`

is one of the approximate aggregate functions, and it returns an approximate percentile for a set of values (e.g. column). In this short article, we will explain how to use the `approx_percentile`

function.

## What is a percentile?

From Wikipedia:

In statistics, a percentile (or a centile) is a score below which a given percentage of scores in its frequency distribution falls (exclusive definition) or a score at or below which a given percentage falls (inclusive definition)

To apply this, we’ll walk through an example with data points from a known, and arguably most famous, distribution—-the Normal (or Gaussian) distribution. The adjacent diagram plots the density of a Normal distribution with a mean of 100 and standard deviation of 10.

If we were to sample data points from this Normal distribution, we know that approximately half of the data points would be less that the mean and half of the data points would be above the the mean. Hence, the mean, or 100 in this case, would be the 50th percentile for the data. It turns out that the 90th percentile would approximately be 112.82; this means that 90% of the data points are less than 112.82.

## approx_percentile by example

To solidify our understanding of this and the `approx_percentile`

function, we’ve created a few tables to use as example:

```
presto:default> show tables;
Table
-------------
dummy
norm_0_1
norm_100_10
norm_all
(4 rows)
```

Table | Description | Number of Rows |
---|---|---|

`dummy` | Single column 100 row table of all ones except for a single value of 100. | 100 |

`norm_0_1` | Samples from normal distribution with mean of 0 and standard deviation of 1. | 5000 |

`norm_100_10` | Samples from normal distribution with mean of 100 and standard deviation of 10. | 5000 |

`norm_all` | Coalescence of all normal distribution tables. | 10000 |

The `approx_percentile`

function has eight type signatures. You are encouraged to review the Presto public documentation for all the function variants and official descriptions. The set of values (e.g. column) is a required parameter and is always the first argument.

Another required parameter is the `percentage`

parameter, which indicates the percentage or percentages for the returned approximate percentile. The percentage(s) must be specified as a number between zero and one. The `percentage`

parameter can either be the second or third argument of the function, depending on the intended signature. In the following examples, the `percentage`

parameter will be the second argument. For example, `approx_percentile(x,0.5)`

will return the approximate percentile for column `x`

at 50%. For data points in our `norm_100_10`

table, we expect the returned value to be around 100.

```
presto:default> select approx_percentile(x,0.5) from norm_100_10;
_col0
------------------
99.8184647799587
(1 row)
```

`approx_percentile(x,0.9)`

will return the approximate percentile for column x at 90%, which for the data in norm_100_10 table should be around 112.82.

```
presto:default> select approx_percentile(x,0.9) from norm_100_10;
_col0
------------------
112.692881777202
(1 row)
```

In a single query, you can also specify an array of percentages to compute percentiles: `approx_percentile(x,ARRAY[0.5, 0.9])`

.

```
presto:default> select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_100_10;
_col0
--------------------------------------
[99.8184647799587, 112.692881777202]
(1 row)
```

We can ask for multiple percentages for our `dummy`

table, which consists of a 100 rows of all ones except for a single value of 100. Hence, we expect all percentiles below 99% to 1.

```
presto:default> select approx_percentile(x,ARRAY[0.1, 0.5, 0.98, 0.99]) from dummy;
_col0
------------------------
[1.0, 1.0, 1.0, 100.0]
(1 row)
```

We can also use a `GROUP BY`

clause to segment the values to compute percentiles over. To illustrate this, we will use our `norm_all`

table, which contains values from both the `norm_100_10`

and the `norm_0_1`

tables. The `m`

and `sd`

columns specify the mean and standard deviation of the normal distribution the corresponding `x`

value is sampled from.

```
presto:default> select m, sd, x from norm_all order by rand() limit 10;
m | sd | x
-----+----+--------------------
0 | 1 | -0.540796486700647
0 | 1 | 0.81148151337731
0 | 1 | 1.28976310661005
100 | 10 | 97.0272872801269
100 | 10 | 83.1392343835652
0 | 1 | -0.585678877703149
0 | 1 | 0.268589447255106
0 | 1 | -0.280908719376113
100 | 10 | 104.36328077332
0 | 1 | 0.266294347905949
(10 rows)
```

The following query then will return approximate score for 50% and 90% for data points grouped by the same values of `m`

and `sd`

(i.e. from the same normal distribution): `select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_all group by grouping sets ((m,sd))`

. As expected, we see that the approximate 50% and 90% percentile is around 100 and 112.82 for a mean of 100 and standard deviation of 10 and around 0 and 1.28 for a mean of 0 and standard deviation of 1.

```
presto:default> select approx_percentile(x,ARRAY[0.5, 0.9]) from norm_all group by grouping sets ((m,sd));
_col0
------------------------------------------
[99.8563616481321, 112.879972343696]
[-0.00458419083839064, 1.30949677294588]
(2 rows)
```

An optional parameter is `accuracy`

, which controls the maximum rank error and defaults to 0.01. The value of accuracy must be between zero and one (exclusive) and must be constant for all input rows. We can add `accuracy`

as the third argument to our function.

For example, `approx_percentile(x,0.9,0.5)`

will return the approximate score for column `x`

at 90% with 0.5 maximum rank error accuracy. By allowing for larger error (from the 0.01 default), we can see the approximate score of 113.50 is farther away from our true value of 112.82 than our previous result of 112.88.

```
presto:default> select approx_percentile(x,0.9,0.5) from norm_100_10;
_col0
--------------------
113.49999999999999
(1 row)
```

If you want to get started with Presto easily, check out Ahana Cloud. It’s SaaS for Presto and takes away all the complexities of tuning, management and more. It’s free to try out for 14 days, then it’s pay-as-you-go through the AWS marketplace.