On average, how many people do you need to ask to find two with same birthday?
Birthday paradox
from random import choicefrom statistics import meanimport matplotlib.pyplot as pltdays =range(365)def same_birthday_size(): bdays =set()while (chosen := choice(days)) notin bdays: bdays.add(chosen)returnlen(bdays) +1sizes = [same_birthday_size() for _ inrange(100000)]plt.hist(sizes, bins=range(max(sizes)), density=True)plt.show()print("Mean is", mean(sizes))
Birthday paradox
Mean is 24.61776
The Small Schools Myth
ENEM scores from high schools in Brazil.
import pandas as pddf = pd.read_csv("./data/enem_scores.csv")df.sort_values(by="avg_score", ascending=False).head(10)
year
school_id
number_of_students
avg_score
16670
2007
33062633
68
82.97
16796
2007
33065403
172
82.04
16668
2005
33062633
59
81.89
16794
2005
33065403
177
81.66
10043
2007
29342880
43
80.32
18121
2007
33152314
14
79.82
16781
2007
33065250
80
79.67
3026
2007
22025740
144
79.52
14636
2007
31311723
222
79.41
17318
2007
33087679
210
79.38
The Small Schools Myth
Code
import numpy as npimport seaborn as snsplot_data = (df.assign(top_school = df["avg_score"] >= np.quantile(df["avg_score"], .99))[["top_school", "number_of_students"]].query(f"number_of_students<{np.quantile(df['number_of_students'], .98)}")) # remove outlierssns.boxplot(x="top_school", y="number_of_students", data=plot_data)plt.title("Number of Students of 1% Top Schools (Right)");
The Most Dangerous Equation
Coined by statistician Howard Wainer in 20091.
Smaller samples have larger variance in the sample mean.
\sigma_{\bar{x}}^2 = \sigma^2 / n
Use to determine variance of \bar{x} in the Central Limit Theorem.
The Small Schools Myth
Code
q_99 = np.quantile(df["avg_score"], .99)q_01 = np.quantile(df["avg_score"], .01)groups =lambda d: np.select([d["avg_score"] > q_99, d["avg_score"] < q_01], ["Top", "Bottom"], "Middle")plot_data = df.sample(10000).assign(Group = groups)sns.scatterplot(y="avg_score", x="number_of_students", hue="Group", data=plot_data)plt.title("Mean ENEM Score by Number of Students in the School");
Conclusion
“Learning by doing, peer-to-peer teaching, and computer simulation are all part of the same equation.” – Nicholas Negroponte, founder of MIT Media Lab and OLPC