Data science etudes -- explorations of statistical concepts through code.
Conformal prediction is the standout recent development: it provides distribution-free prediction intervals with guaranteed coverage -- no assumptions about the data-generating process. Chatterjee's Xi coefficient and distance correlation address a fundamental limitation of Pearson correlation: they detect arbitrary nonlinear associations, which matters for feature selection, independence testing, and understanding complex data relationships. The information-theoretic ML foundations connect Shannon entropy to generalization bounds and PAC-Bayes theory -- this is the theoretical frontier of understanding why deep learning works.
This repository contains Jupyter notebooks that explore fundamental statistical concepts, with a focus on understanding how different measures capture relationships in data.
pearson_correlation_coefficient_vs_mutual_information.ipynb - Compares how Pearson correlation and mutual information capture dependence in bivariate normal data, inspired by Nassim Taleb
This notebook explores the relationship between Pearson correlation and mutual information:
- Pearson Correlation: Shows how correlation coefficient (rho) scales non-linearly -- the perceptual gap between rho=0.5 and rho=0.9 is much larger than between rho=0 and rho=0.5
- Mutual Information: Demonstrates how mutual information scales linearly with information content, providing a more intuitive measure of dependence
- Comparison: Visualizes bivariate normal distributions across both metrics to illustrate their differences
The notebook is based on this tweet by Nassim Taleb.
- Python 3.7+
- Jupyter
- NumPy
- Matplotlib
-
Install dependencies:
pip install jupyter numpy matplotlib
-
Start Jupyter:
jupyter notebook
-
Open
pearson_correlation_coefficient_vs_mutual_information.ipynband run the cells.
See LICENSE file for details.
These resources were recently found and have not been reviewed yet.
- Information Theory chapter -- Dive into Deep Learning (d2l.ai) - Interactive notebook covering entropy, KL divergence, mutual information
- Probabilistic Machine Learning: An Introduction - Kevin Murphy (MIT Press, 2022, free) - 944-page textbook with Colab notebooks covering information theory and Bayesian reasoning
- Kernel Partial Correlation Coefficient (JMLR 2022) - Nonparametric conditional dependence measure extending Pearson/MI comparisons
- hyppo: Multivariate Hypothesis Testing Python Package - Unified library for distance correlation, HSIC, MGC with tutorials
- Causal Inference for the Brave and True - Matheus Facure - Free Python notebook handbook on DAGs, propensity scoring, heterogeneous effects
- pytudes - Peter Norvig - Gold-standard exploratory Python notebooks on probability, puzzles, algorithms
- Probably Overthinking It notebooks - Allen Downey (2023) - Base-rate fallacy, Simpson's paradox, distributional thinking with real datasets
- Semi-Distance Correlation (JASA 2024) - Extends distance correlation to mixed categorical-continuous data
- Copula models in Python using sympy - From-scratch Gaussian and Archimedean copulas, good companion to Pearson/MI comparison
- Probably Overthinking It - Allen Downey (2023) - Data-first statistical reasoning exposing common traps; all code in open notebooks
- Causal Inference in Python - Matheus Facure (O'Reilly, 2023) - DiD, IV, synthetic control applied to industry problems
- Chatterjee's Xi Correlation in SciPy (scipy.stats.chatterjeexi, v1.15, 2024) - Rank-based asymmetric dependence measure detecting non-monotonic associations; now in SciPy
- KDist: Kernel and Distance Methods for Statistical Inference - Zhang (R, 2025) - R package unifying HSIC, distance correlation, energy statistics, change-point detection, and conditional independence tests
- Information-Theoretic Foundations for Machine Learning - Jeon et al. (arXiv, 2024) - 200+ page monograph unifying generalization, meta-learning, and misspecification through Shannon information theory
- Mutual Information Estimation via Normalizing Flows - Butakov et al. (NeurIPS 2024) - Maps data to tractable distribution for closed-form MI estimation with provable error bounds
- Introduction to Conformal Prediction with Python - Christoph Molnar (2023) - Practical book with Jupyter notebooks on distribution-free uncertainty quantification for any ML model
- Conformal Prediction Notebooks - Angelopoulos & Bates (GitHub, 2024) - Hands-on notebooks applying conformal prediction to ImageNet, medical, toxic text, and tumor segmentation
- Think Stats 3rd Edition - Allen Downey (O'Reilly, 2025) - Major rewrite entirely in Jupyter notebooks; EDA, distributions, hypothesis testing, regression
- Statistical Rethinking 2024 Course - Richard McElreath (GitHub, 2024) - Full Bayesian modeling course with video lectures and code in R, Python/PyMC, Julia/Turing
- Statistical Consequences of Fat Tails: Notebooks - MarcosCarreira (GitHub) - Python/R/Mathematica notebooks reproducing every chapter of Taleb's Technical Incerto
- fattails: Python Notebooks for Fat-Tailed Statistics - FergM (GitHub) - CLT failure under fat tails, S&P500 geometric averages, power-law diagnostics
- Julia for Data Analysis - Bogumil Kaminski (Manning, 2023) - Hands-on guide to DataFrames.jl, time series, and predictive models; most current practical Julia data science book
- dcor: Distance Correlation and Energy Statistics in Python - Ramos-Carreno & Torrecilla (SoftwareX, 2023) - Efficient distance correlation, energy distance, and independence tests; natural extension of Pearson/MI comparison