Statistics with Python

PyCon SG 2023 Education Summit

Melvin Zhang

Computational Thinkerers

DYK Python has a statistics module?

from statistics import mean, mode, median, variance, linear_regression

xs = [1, 2, 3, 4, 4]
print("mean", mean(xs))
print("mode", mode(xs))
print("median", median(xs))
print("variance", variance(xs))

ys = [5, 3, 2, 1, 0]
print(linear_regression(xs, ys))
mean 2.8
mode 4
median 3
variance 1.7
LinearRegression(slope=-1.4411764705882355, intercept=6.23529411764706)

Monty Hall problem

Behind one door is a car, others have goats. After you have chosen, the host opened a door with a goat.

Should you switch or keep your original choice?

source: wikipedia.org/wiki/Monty_Hall_problem

Monty Hall problem

import random

def play(switch):
    doors = [1, 2, 3]
    pair = (random.choice(doors), random.choice(doors))
    winning_door, chosen_door = pair
    opened_door = random.choice([d for d in doors if d not in pair])
    if switch:
        chosen_door = sum(doors) - chosen_door - opened_door
    return winning_door == chosen_door

rounds = 1000
wins = {True: 0.0, False: 0.0}
for _ in range(rounds):
    for switch in wins.keys():
        wins[switch] += 1/rounds if play(switch) else 0.0

print(wins)

Result

{True: 0.6600000000000005, False: 0.35700000000000026}

Birthday paradox

On average, how many people do you need to ask to find two with same birthday?

Birthday paradox

from random import choice
from statistics import mean
import matplotlib.pyplot as plt

days = range(365)

def same_birthday_size():
    bdays = set()
    while (chosen := choice(days)) not in bdays:
        bdays.add(chosen)
    return len(bdays) + 1

sizes = [same_birthday_size() for _ in range(100000)]
plt.hist(sizes, bins=range(max(sizes)), density=True)
plt.show()
print("Mean is", mean(sizes))

Birthday paradox

Mean is 24.61776

The Small Schools Myth

ENEM scores from high schools in Brazil.

import pandas as pd
df = pd.read_csv("./data/enem_scores.csv")
df.sort_values(by="avg_score", ascending=False).head(10)
year school_id number_of_students avg_score
16670 2007 33062633 68 82.97
16796 2007 33065403 172 82.04
16668 2005 33062633 59 81.89
16794 2005 33065403 177 81.66
10043 2007 29342880 43 80.32
18121 2007 33152314 14 79.82
16781 2007 33065250 80 79.67
3026 2007 22025740 144 79.52
14636 2007 31311723 222 79.41
17318 2007 33087679 210 79.38

The Small Schools Myth

Code
import numpy as np
import seaborn as sns
plot_data = (df
.assign(top_school = df["avg_score"] >= np.quantile(df["avg_score"], .99))
[["top_school", "number_of_students"]]
.query(f"number_of_students<{np.quantile(df['number_of_students'], .98)}")) # remove outliers

sns.boxplot(x="top_school", y="number_of_students", data=plot_data)
plt.title("Number of Students of 1% Top Schools (Right)");

The Most Dangerous Equation

Coined by statistician Howard Wainer in 20091.

Smaller samples have larger variance in the sample mean.

\sigma_{\bar{x}}^2 = \sigma^2 / n

Use to determine variance of \bar{x} in the Central Limit Theorem.

The Small Schools Myth

Code
q_99 = np.quantile(df["avg_score"], .99)
q_01 = np.quantile(df["avg_score"], .01)
groups = lambda d: np.select([d["avg_score"] > q_99, d["avg_score"] < q_01], ["Top", "Bottom"], "Middle")
plot_data = df.sample(10000).assign(Group = groups)
sns.scatterplot(y="avg_score", x="number_of_students", hue="Group", data=plot_data)
plt.title("Mean ENEM Score by Number of Students in the School");

Conclusion

“Learning by doing, peer-to-peer teaching, and computer simulation are all part of the same equation.”
– Nicholas Negroponte, founder of MIT Media Lab and OLPC


Additional resources

Extra topics

  • Central Limit Theorem
  • Friendship paradox
  • Simpson’s paradox
  • Simulating a loaded dice efficiently
  • Bessel’s correction for sample variance

Bessel’s correction

Population mean \mu = \frac{1}{N} \sum_{i=1}^{N} x_i

Sample mean \bar{x} = \frac{1}{N} \sum_{i=1}^{N} x_i

Bessel’s correction

Population variance \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)

Sample variance s^2 = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})

Bessel’s correction

from statistics import mean, variance, pvariance
from random import gauss

pvars = []
svars = []
for _ in range(1000):
    data = [gauss(mu=0.0, sigma=1.0) for _ in range(20)]
    pvars.append(pvariance(data))
    svars.append(variance(data))

print("Actual variance =", 1.0)
print("Mean pvariance  =", mean(pvars))
print("Mean variance   =", mean(svars))

Result

Actual variance = 1.0
Mean pvariance  = 0.971613077249861
Mean variance   = 1.0227506076314326