Statistics with Python

PyCon SG 2023 Education Summit

Melvin Zhang

Computational Thinkerers

DYK Python has a statistics module?

from statistics import mean, mode, median, variance, linear_regression

xs = [1, 2, 3, 4, 4]
print("mean", mean(xs))
print("mode", mode(xs))
print("median", median(xs))
print("variance", variance(xs))

ys = [5, 3, 2, 1, 0]
print(linear_regression(xs, ys))

mean 2.8
mode 4
median 3
variance 1.7
LinearRegression(slope=-1.4411764705882355, intercept=6.23529411764706)

Monty Hall problem

Behind one door is a car, others have goats. After you have chosen, the host opened a door with a goat.

Should you switch or keep your original choice?

source: wikipedia.org/wiki/Monty_Hall_problem

Monty Hall problem

import random

def play(switch):
    doors = [1, 2, 3]
    pair = (random.choice(doors), random.choice(doors))
    winning_door, chosen_door = pair
    opened_door = random.choice([d for d in doors if d not in pair])
    if switch:
        chosen_door = sum(doors) - chosen_door - opened_door
    return winning_door == chosen_door

rounds = 1000
wins = {True: 0.0, False: 0.0}
for _ in range(rounds):
    for switch in wins.keys():
        wins[switch] += 1/rounds if play(switch) else 0.0

print(wins)

Result

{True: 0.6600000000000005, False: 0.35700000000000026}

Birthday paradox

On average, how many people do you need to ask to find two with same birthday?

Birthday paradox

from random import choice
from statistics import mean
import matplotlib.pyplot as plt

days = range(365)

def same_birthday_size():
    bdays = set()
    while (chosen := choice(days)) not in bdays:
        bdays.add(chosen)
    return len(bdays) + 1

sizes = [same_birthday_size() for _ in range(100000)]
plt.hist(sizes, bins=range(max(sizes)), density=True)
plt.show()
print("Mean is", mean(sizes))

Birthday paradox

Mean is 24.61776

The Small Schools Myth

ENEM scores from high schools in Brazil.

import pandas as pd
df = pd.read_csv("./data/enem_scores.csv")
df.sort_values(by="avg_score", ascending=False).head(10)

	year	school_id	number_of_students	avg_score
16670	2007	33062633	68	82.97
16796	2007	33065403	172	82.04
16668	2005	33062633	59	81.89
16794	2005	33065403	177	81.66
10043	2007	29342880	43	80.32
18121	2007	33152314	14	79.82
16781	2007	33065250	80	79.67
3026	2007	22025740	144	79.52
14636	2007	31311723	222	79.41
17318	2007	33087679	210	79.38

The Small Schools Myth

Code

import numpy as np
import seaborn as sns
plot_data = (df
.assign(top_school = df["avg_score"] >= np.quantile(df["avg_score"], .99))
[["top_school", "number_of_students"]]
.query(f"number_of_students<{np.quantile(df['number_of_students'], .98)}")) # remove outliers

sns.boxplot(x="top_school", y="number_of_students", data=plot_data)
plt.title("Number of Students of 1% Top Schools (Right)");

The Most Dangerous Equation

Coined by statistician Howard Wainer in 2009¹.

Smaller samples have larger variance in the sample mean.

\sigma_{\bar{x}}^2 = \sigma^2 / n

Use to determine variance of \bar{x} in the Central Limit Theorem.

The Small Schools Myth

Code

q_99 = np.quantile(df["avg_score"], .99)
q_01 = np.quantile(df["avg_score"], .01)
groups = lambda d: np.select([d["avg_score"] > q_99, d["avg_score"] < q_01], ["Top", "Bottom"], "Middle")
plot_data = df.sample(10000).assign(Group = groups)
sns.scatterplot(y="avg_score", x="number_of_students", hue="Group", data=plot_data)
plt.title("Mean ENEM Score by Number of Students in the School");

Conclusion

“Learning by doing, peer-to-peer teaching, and computer simulation are all part of the same equation.”
– Nicholas Negroponte, founder of MIT Media Lab and OLPC

Additional resources

This document melvinzhang.github.io/stats-with-python
Data Science Discovery by UIUC
Python for Statistical Analysis by Udemy
Statistical Simulation in Python by DataCamp

Extra topics

Central Limit Theorem
Friendship paradox
Simpson’s paradox
Simulating a loaded dice efficiently
Bessel’s correction for sample variance

Bessel’s correction

Population mean \mu = \frac{1}{N} \sum_{i=1}^{N} x_i

Sample mean \bar{x} = \frac{1}{N} \sum_{i=1}^{N} x_i

Bessel’s correction

Population variance \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)

Sample variance s^2 = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})

Bessel’s correction

from statistics import mean, variance, pvariance
from random import gauss

pvars = []
svars = []
for _ in range(1000):
    data = [gauss(mu=0.0, sigma=1.0) for _ in range(20)]
    pvars.append(pvariance(data))
    svars.append(variance(data))

print("Actual variance =", 1.0)
print("Mean pvariance  =", mean(pvars))
print("Mean variance   =", mean(svars))

Result

Actual variance = 1.0
Mean pvariance  = 0.971613077249861
Mean variance   = 1.0227506076314326