Explanation on why Numpy uses np.quantile to get quantile data that is not in the original array

What is quantile

Quantile is a statistical concept used to describe location characteristics in a data set. In an ordered data set, the

q

q

The q quantile is a value such that there are at least

q

×

100

%

q \times 100\%

q × 100% of the data points are less than or equal to this value, and at least

(

1

?

q

)

×

100

%

(1-q) \times 100\%

(1?q)×100% of the data points are greater than or equal to this value.

When you use the np.quantile function in NumPy and set axis=0, the function will calculate quantiles column-wise. Specifically, it will find the corresponding

q

q

q quantiles.

Regarding your question “Why not in the original array”, this is because the quantile is calculated based on the distribution of the data, and is not necessarily the value that actually appears in the data set. A quantile may be a single value in a data set, or it may be the average of two (or more) adjacent values.

Let’s illustrate with an example. Suppose we have the following array:

arr =
[[0.77395605, 0.43887844, 0.85859792, 0.69736803],
[0.09417735, 0.97562235, 0.7611397, 0.78606431],
[0.12811363, 0.45038594, 0.37079802, 0.92676499]]

We can use NumPy’s np.quantile function to find the

1

/

4

1/4

1/4 quantile.

calculated

1

/

4

1/4

The 1/4 quantile (by column) is:

[0.11114549, 0.44463219, 0.56596886, 0.74171617]

this means:

  • In the first column, there are at least

    25

    25\%

    25 data is less than or equal to

    0.11114549

    0.11114549

    0.11114549, and at least

    75

    75\%

    75 data is greater than or equal to

    0.11114549

    0.11114549

    0.11114549.

  • In the second column, there are at least

    25

    25\%

    25 data is less than or equal to

    0.44463219

    0.44463219

    0.44463219, and at least

    75

    75\%

    75 data is greater than or equal to

    0.44463219

    0.44463219

    0.44463219.

  • In the third column, there are at least

    25

    25\%

    25 data is less than or equal to

    0.56596886

    0.56596886

    0.56596886, and at least

    75

    75\%

    75 data is greater than or equal to

    0.56596886

    0.56596886

    0.56596886.

  • In the fourth column, there are at least

    25

    25\%

    25 data is less than or equal to

    0.74171617

    0.74171617

    0.74171617, and at least

    75

    75\%

    75 data is greater than or equal to

    0.74171617

    0.74171617

    0.74171617.

Note that these quantile values may not be in the original array. They are calculated based on the data distribution and may be the median or interpolation of two adjacent data points. This is also why the quantiles may not be in the original array.

Code:

import numpy as np

# Given array
arr = np.array([
    [0.77395605, 0.43887844, 0.85859792, 0.69736803],
    [0.09417735, 0.97562235, 0.7611397, 0.78606431],
    [0.12811363, 0.45038594, 0.37079802, 0.92676499]
])

# Calculate the 1/4 quantile for each column
quantile_25 = np.quantile(arr, q=0.25, axis=0)
quantile_25

How to find the median

There are many ways to calculate quantiles, but a common method is the following steps:

  1. Sort data: First, sort the data (either ascending or descending order, but usually ascending order).

  2. Calculate positional parameters: Use the following formula to calculate

    q

    q

    Position parameter (index) of q quantile:

i

=

q

×

(

N

?

1

)

+

1

i = q \times (N – 1) + 1

i=q×(N?1) + 1

in

N

N

N is the number of data points,

q

q

q is the quantile sought (for example, for

1

/

4

1/4

1/4 quantile,

q

=

0.25

q = 0.25

q=0.25).

  1. Find quantiles:
  • Integer index: if

    i

    i

    i is an integer, then

    i

    i

    The data value at index i is

    q

    q

    q quantiles.

  • Non-integer index: If

    i

    i

    i is not an integer, then linear interpolation is usually used to find the quantiles. This is usually done with the following formula:

Quantile

=

(

1

?

α

)

×

Value

?

i

?

+

α

×

Value

?

i

?

\text{Quantile} = (1 – \alpha) \times \text{Value}_{\lfloor i \rfloor} + \alpha \times \text{Value}_{\ lceil i \rceil}

Quantile=(1?α)×Value?i + α×Value?i

in

?

i

?

\lfloor i \rfloor

?i? and

?

i

?

\lceil i \rceil

?i? respectively

i

i

Rounding down and up of i,

α

=

i

?

?

i

?

\alpha = i – \lfloor i \rfloor

α=ii?.

Let’s explain this process through a concrete example. We can use the first column of the NumPy array ([0.77395605, 0.09417735, 0.12811363]) to calculate

1

/

4

1/4

1/4 quantile.

After sorting the data in the first column ([0.77395605, 0.09417735, 0.12811363]), we get:

Sorted data = [0.09417735, 0.12811363, 0.77395605]

calculate

1

/

4

1/4

Position parameter for 1/4 quantile

i

i

i (1-based index):

i

=

0.25

×

(

3

?

1

)

+

1

=

1.5

i = 0.25 \times (3 – 1) + 1 = 1.5

i=0.25×(3?1) + 1=1.5

because

i

i

i is not an integer (

i

=

1.5

i = 1.5

i=1.5), we need to use linear interpolation to find the quantile. We use the following formula:

Quantile

=

(

1

?

α

)

×

Value

?

i

?

+

α

×

Value

?

i

?

\text{Quantile} = (1 – \alpha) \times \text{Value}_{\lfloor i \rfloor} + \alpha \times \text{Value}_{\ lceil i \rceil}

Quantile=(1?α)×Value?i + α×Value?i

In this example,

Value

?

i

?

=

0.09417735

\text{Value}_{\lfloor i \rfloor} = 0.09417735

Value?i=0.09417735,

Value

?

i

?

=

0.12811363

\text{Value}_{\lceil i \rceil} = 0.12811363

Value?i=0.12811363,

α

=

1.5

?

1

=

0.5

\alpha = 1.5 – 1 = 0.5

α=1.5?1=0.5.

so,

Quantile

=

(

1

?

0.5

)

×

0.09417735

+

0.5

×

0.12811363

=

0.11114549

\text{Quantile} = (1 – 0.5) \times 0.09417735 + 0.5 \times 0.12811363 = 0.11114549

Quantile=(1?0.5)×0.09417735 + 0.5×0.12811363=0.11114549

This is exactly what we get with NumPy. This way, you can better understand how the quantile is calculated and why it might not be in the original data set.

Code:

# Extract the first column from the array
first_column = arr[:, 0]

# Sort the data
sorted_data = np.sort(first_column)

# Number of data points
N = len(sorted_data)

#Quantile to calculate (1/4)
q = 0.25

# Calculate index parameter (1-based index)
i = q * (N - 1) + 1

#Calculatequantile
if i.is_integer():
    quantile_value = sorted_data[int(i) - 1]
else:
    lower_value = sorted_data[int(np.floor(i)) - 1]
    upper_value = sorted_data[int(np.ceil(i)) - 1]
    alpha = i - np.floor(i)
    quantile_value = (1 - alpha) * lower_value + alpha * upper_value

sorted_data, i, quantile_value