When we measure something, we always have to calculate the uncertainty of the result. confidence intervals are a very useful tool to calculate a range in which we can find the real value of the observable with a certain confidence.
What is a confidence interval?
Imagine you ask me my height. I could say that I’m 1.93 m tall, but I’m not giving you any information about the uncertainty of this measure. Confidence intervals are intervals in which we have a certain confidence to find the real value of the observable we measure. Scientists usually search for the 95% confidence interval, but it’s very common to use 90% or even 99% as well. So, when you ask me about my height, I should answer you with an error estimate or with a confidence interval, like “with a 95% confidence I’m from 1.92 m to 1.93 m tall”.
That’s what this tool gives us: an interval of where to find the real value of the observable.
Some useful properties of confidence intervals are:
- Fixing the confidence, the interval becomes narrower and narrower when the sample size increases. It’s due to the law or large numbers
- Fixing the number of points in the sample, the interval becomes wider and wider when the confidence increases. So, to have a greater confidence, we must have a larger interval.
In data science and statistics, confidence intervals are very useful to give a scientific result of our measure, that can be used by other scientists to compare their results with ours.
Confidence interval formula
In this article, I’ll cover the calculation of the confidence interval on the mean value of a sample, which is an estimate of the population expected value.
Given m the mean value, s the sample standard deviation and N the sample size, the confidence interval is defined by the following formula:
\left( m - t \frac{s}{\sqrt N},m + t \frac{s}{\sqrt N} \right)
There’s a t parameter, as you can see, which is related to the confidence we want. The calculation of this parameter can be done in different ways. If our sample size is small (i.e. less than 30 points), we can use Student’s t distribution to calculate it. Given the confidence, we have to select that value of t that gives an area of the distribution in [-t,t] interval that is equal to our confidence.
Mathematically speaking, given a confidence value equal to c, the corresponding value of t is:
t_c = \left| I_{N-1}\left( \frac{1-c}{2}\right) \right |
where I(x) is the Student’s inverse cumulative distribution function with N-1 degrees of freedom. Practically speaking, it’s the value at which the right tail of the distribution is equal to half of the remaining area once we subtract the confidence from 1. This way, the area included between the tails is equal to the confidence we want.
If the sample size is large (i.e. larger than 30 points), we can approximate Student’s t distribution with a normal distribution and forget about the degrees of freedom.
The reason behind these distributions is a consequence of the central limit theorem. The average value of a sample behaves, for large samples, like a gaussian variable (if the measures are independent and the population has a finite variance). For small sample sizes, we use Student’s t distribution.
Using the bootstrap
Who follows my articles knows that I really love bootstrap technique. That’s because it’s an unbiased algorithm that doesn’t make any assumption on the distribution of our dataset.
We can use bootstrap to calculate confidence intervals as well using this simple procedure:
- Create a new sample based on our dataset, with replacement and with the same number of points
- Calculate the mean value and store it in an array or list
- Repeat the process many times (e.g. 1000)
- On the list of the mean values, calculate 2.5th percentile and 97.5th percentile (if you want a 95% confidence interval)
Bootstrap gives us an unbiased estimate, paying the cost of a computationally complex algorithm. I prefer using it when it’s not a problem to code such an algorithm, but you can generally use the original formula safely in almost every situation.
Confidence interval calculator in Python
Let’s now calculate the confidence intervals in Python using Student’s t distribution and the bootstrap technique.
Let’s import some useful libraries.
import numpy as np
from scipy.stats import t
Let’s now simulate a dataset made of 100 numbers extracted from a normal distribution.
x = np.random.normal(size=100)
Let’s see we want to calculate the 95% confidence interval of the mean value. Let’s calculate all the numbers we need according to the formula of confidence intervals.
m = x.mean()
s = x.std()
dof = len(x)-1
confidence = 0.95
We now need the value of t. The function that calculates the inverse cumulative distribution is ppf. We need to apply the absolute value because the cumulative distribution works with the left tail, so the result would be negative.
t_crit = np.abs(t.ppf((1-confidence)/2,dof))
Now, we can apply the original formula to calculate the 95% confidence interval.
(m-s*t_crit/np.sqrt(len(x)), m+s*t_crit/np.sqrt(len(x)))
# (-0.14017768797464097, 0.259793719043611)
We know it’s correct because the normal distribution has 0 mean, but if we don’t know anything about the population, we could say that, with 95% confidence, the expected value of the population lies between -0.14 and 0.26.
We could have reached the same result using a bootstrap, which is unbiased. In this example, I create 1000 resamples of our dataset (with replacements).
values = [np.random.choice(x,size=len(x),replace=True).mean() for i in range(1000)]
np.percentile(values,[100*(1-confidence)/2,100*(1-(1-confidence)/2)])
# array([-0.13559955, 0.26480175])
As we can see, the result is almost equal to the one we have reached with the closed formula.
Conclusions
Confidence intervals are easy to calculate and can give a very useful insight to data analysts and scientists. They give a very powerful error estimate and, if used correctly, can really help us to extract as much information as possible from our data.