BACK TO COMMAND CENTER

STATISTICS SURVIVAL WALL // Andy's Edition // PG 01

CORE FOUNDATIONS & THEORIES

Greek Symbols & Notation

μPopulation Mean
Sample Mean
σPopulation Std Dev
sSample Std Dev
αAlpha (Type I Error Rate)
βBeta (Type II Error Rate)
χ²Chi-Squared Statistic
FF-Statistic (ANOVA/Reg)
rPearson Correlation
Variance Explained
η²Eta Squared (Effect)
dCohen's d (Effect)
nSample Size
dfDegrees of Freedom

Descriptive Stats

  • Mean: Average. Sensitive to outliers.
  • Median: Middle value (50th percentile). Robust.
  • Mode: Most frequent value.
  • Variance (s²): Avg squared deviation from mean.
  • Std Dev (s): Square root of variance. In original units.
  • IQR: Q3 - Q1. Middle 50% of data.

Distribution Concepts

  • Population: The entire group (μ, σ).
  • Sample: Subset (x̄, s).
  • Sampling Dist: Dist of a statistic (e.g., means) across infinite samples.

Central Limit Theorem

As sample size (n ≥ 30) increases, the sampling distribution of the mean becomes normally distributed, regardless of the population's shape.

Normal Distribution

  • Symmetrical, bell-shaped.
  • 68-95-99.7 Rule: % of data within 1, 2, and 3 standard deviations.

Confidence Intervals (CI)

Range containing the true parameter with X% confidence.

LevelZ-CritInterpretation
90%1.645Narrow, higher error
95%1.960Standard baseline
99%2.576Wide, conservative

Alpha, Beta & Power

Decision H0 True H0 False
Reject H0 Type I Error (α) True Pos (Power)
Fail Reject True Negative Type II Error (β)
  • α (Alpha): Prob of rejecting a true H0 (False Positive).
  • β (Beta): Prob of failing to reject false H0 (False Negative).
  • Power (1 - β): Prob of correctly rejecting a false H0.

Hypothesis Workflow

1. Formulate Question
2. State H0 & HA
3. Choose Alpha (α)
4. Check Assumptions
5. Choose & Run Test
6. Interpret p-value
7. Check Effect Size

Anatomy of a Statistical Output

Most software prints a cryptic block. Here is how to decode the matrix.

> t.test(Group_A, Group_B, var.equal=FALSE)

    Welch Two Sample t-test

data: Group_A and Group_B
t = 2.841, df = 47.3, p-value = 0.0067
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.3225   7.1031
sample estimates:
mean of x  mean of y
 126.33     122.11
t (Test Statistic)
The calculated signal-to-noise ratio. Larger absolute value = stronger signal.
df (Degrees of Freedom)
Sample size adjusted for parameters. With Welch, this gets fractional (47.3).
p-value (Decision)
Probability of data assuming H0 is true. If p < α (e.g. 0.05), Reject H0.
CI (Plausible Range)
If the CI does NOT cross 0, the difference is statistically significant.

Statistical Mantras

STANDARD RULE:

If the p is low,
the null (H0) must go!

MASTER CUSTOM RULE:

If the p is high,
the 1 (H1) must die! 🍌
  • Fail to reject H0 ≠ Accept H0. (Lack of evidence is not proof of absence).
  • Significant ≠ Important. (Check Effect Size!)
  • Correlation ≠ Causation.

STATISTICS SURVIVAL WALL // Andy's Edition // PG 02

TEST SELECTION PROTOCOL & CORE TESTS

Test Selection Terminal

WHAT IS YOUR OUTCOME VARIABLE? │ ├── NUMERICAL (Continuous/Interval) │ │ │ ├── Compare to a fixed target value ─────────────> One Sample t-Test │ │ │ ├── Compare groups (Categorical Predictor) │ │ │ │ │ ├── 2 Independent Groups ────────────────────> Welch Two-Sample t │ │ │ │ │ ├── 2 Related/Paired Groups (Before/After) ──> Paired t-Test │ │ │ │ │ └── 3+ Independent Groups ───────────────────> One-Way ANOVA │ │ │ └── Relationship (Numerical Predictor) │ │ │ ├── Linear Relationship ─────────────────────> Pearson Correlation │ └── Monotonic (Ranked) Relationship ─────────> Spearman Correlation │ └── CATEGORICAL (Nominal/Ordinal) │ ├── Proportion (1 Variable, Binary Outcome) ─────> Binomial Test │ ├── Frequencies fit expected distribution? ──────> Chi² Goodness of Fit │ └── Association between 2 Variables ─────────────> Chi² Independence

Assumption Failure Matrix

Parametric tests require normality and equal variance. If assumptions fail, use non-parametric equivalents (Rank-based).

Wanted Test Broken Assumption Use Instead
Indep. t-Test Unequal Variance Welch t-Test
Indep. t-Test Non-Normal (Outliers) Mann-Whitney U
Paired t-Test Non-Normal Wilcoxon Signed Rank
ANOVA Non-Normal Kruskal-Wallis
Pearson (r) Non-Normal / Non-Linear Spearman (ρ)

1. Welch Two-Sample t

H0: μ1 = μ2.
Compares means of 2 independent groups. Doesn't assume equal variance (Standard!).

Python
stats.ttest_ind(a, b, equal_var=False)
R
t.test(val ~ grp, var.equal=FALSE)

2. Paired t-Test

H0: μ_diff = 0.
Compares means from same group at different times (Before vs After).

Python
stats.ttest_rel(before, after)
R
t.test(before, after, paired=TRUE)

3. One-Way ANOVA

H0: μ1 = μ2 = μ3...
Compares means of 3+ groups. If p<0.05, at least one group differs. (Needs Post-Hoc!).

Python
stats.f_oneway(g1, g2, g3)
R
summary(aov(val ~ grp, data=df))

4. Chi-Square Indep.

H0: Variables are independent.
Tests association between 2 categorical variables (e.g. Pet Type x City).

Python
chi2_contingency(pd.DataFrame(obs))
R
chisq.test(table_data)

5. Pearson Correlation

H0: r = 0.
Measures linear strength between 2 continuous variables (-1 to 1).

Python
stats.pearsonr(df['x'], df['y'])
R
cor.test(df$x, df$y)

6. Mann-Whitney U

H0: Distributions equal.
Non-parametric alternative to independent t-test (compares ranks, not means).

Python
stats.mannwhitneyu(grp_a, grp_b)
R
wilcox.test(x, y, paired=FALSE)

7. Wilcoxon Signed Rank

H0: Median diff = 0.
Non-parametric alternative to paired t-test.

Python
stats.wilcoxon(before, after)
R
wilcox.test(before, after, paired=TRUE)

8. Kruskal-Wallis

H0: Medians equal.
Non-parametric alternative to ANOVA (3+ groups).

Python
stats.kruskal(g1, g2, g3)
R
kruskal.test(val ~ grp)

STATISTICS SURVIVAL WALL // Andy's Edition // PG 03

REGRESSION, EFFECT SIZES & REPORTING

Linear & Multiple Regression

y = β0 + β1x1 + β2x2 + ... + ε

Models the relationship between predictors (x) and a continuous outcome (y).

Regression Decoder:
Intercept (β0): Predicted y when all x = 0.
Coefficient (Slope β1): Expected change in y for a 1-unit increase in x, holding other x constant.
Residual (ε): Distance from data point to regression line.
R² (R-squared): % of variance in y explained by the model.
Adjusted R²: R² penalized for adding useless predictors (always use this for Multiple Reg).
Assumptions & Diagnostics:
  • Linearity: Plot residuals vs fitted.
  • Homoscedasticity: Equal variance of residuals.
  • Normality: Q-Q Plot of residuals.
  • Multicollinearity: Check VIF (Variance Inflation Factor). If VIF > 5, predictors are highly correlated.
Python
import statsmodels.formula.api as smf
smf.ols("y ~ x1 + x2", data=df).fit().summary()
R
summary(lm(y ~ x1 + x2, data=df))

Logistic Regression

Predicts a Binary outcome (0/1, Yes/No).

  • Odds: P(Event) / P(No Event).
  • Logit: Natural log of the odds. The model runs linearly on logits.
  • Odds Ratio (OR): e^(Coef). OR > 1 → Increases likelihood
    OR = 1 → No effect
    OR < 1 → Decreases likelihood
Python
smf.logit("y ~ x", data=df).fit()
R
glm(y~x, family=binomial)

Post-Hoc Analysis

ANOVA answers: "Is there ANY difference?"
Tukey answers: "WHICH groups differ?"
  • Tukey HSD: Compares all pairs safely.
  • Bonferroni: p_new = p * (num_tests). Strict!
Python (Tukey)
pairwise_tukeyhsd(df['y'], df['grp'])

Effect Sizes (Practical Significance)

STATISTICAL SIGNIFICANCE ≠ PRACTICAL SIGNIFICANCE

A p-value only tells you an effect exists. Effect size tells you if you should care.

Cohen's d (t-Test) 0.2 = Small
0.5 = Medium
0.8 = Large
Pearson r (Corr) 0.1 = Weak
0.3 = Moderate
0.5 = Strong
Eta² η² (ANOVA) 0.01 = Small
0.06 = Medium
0.14 = Large

Reporting Checklist

Always report these elements in APA format:

  • Test used
  • Test statistic (t, F, χ²)
  • Degrees of freedom (df)
  • Exact p-value (p = 0.034)
  • Effect size (d, η², OR)
  • Confidence interval (CI)

Python ↔ R Translation

ConceptPythonR
Meandf.mean()mean(x)
SDdf.std()sd(x)
Summarydf.describe()summary(x)
Crosstabpd.crosstab()table()
Plotseabornggplot2

STATISTICS SURVIVAL WALL // Andy's Edition // PG 04

MACHINE LEARNING & MODEL EVALUATION

Data Splitting & Validation

Never test a model on the data it trained on (Data Leakage / Overfitting).

  • Train Set (70-80%): Used by the algorithm to learn patterns.
  • Test Set (20-30%): Unseen data to evaluate final performance.
  • Validation Set: Used to tune hyperparameters.

k-Fold Cross Validation

Splits data into 'k' chunks. Trains on k-1, tests on the 1 remaining. Repeats k times. Averages the scores. More robust than a single split.

Python (Scikit-Learn)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Classification: The Confusion Matrix

Compares Actual Truth vs. Model Prediction for binary classification.

Predicted Condition
Pred Positive (1) Pred Negative (0)
Actual Pos (1) True Positive
(TP)
False Negative
(FN) - Type II Error
Neg (0) False Positive
(FP) - Type I Error
True Negative
(TN)
Imbalanced Data Warning:

If 99% of emails are normal and 1% is spam, a model predicting "normal" every time has 99% accuracy but is completely useless. Do not use Accuracy for imbalanced data!

Fixes: SMOTE (Oversampling), Undersampling, Class Weights.

Evaluation Metrics

  • Accuracy: (TP+TN) / Total
    Overall correctness (Bad for imbalanced)
  • Precision: TP / (TP + FP)
    When it predicts Yes, how often is it right? (Minimize False Positives)
  • Recall (Sensitivity): TP / (TP + FN)
    Out of all actual Yes, how many did we find? (Minimize False Negatives)
  • F1-Score: 2 * (Prec * Rec) / (Prec + Rec)
    Harmonic mean of Precision and Recall.

ROC Curve & AUC

ROC (Receiver Operating Characteristic): Plots True Positive Rate (Recall) vs False Positive Rate across different probability thresholds.

AUC (Area Under Curve): A single metric evaluating model performance independent of threshold.

  • AUC = 0.5: Useless (Random guessing).
  • AUC = 0.7 - 0.8: Acceptable.
  • AUC = 0.8 - 0.9: Excellent.
  • AUC = 1.0: Perfect (Suspicious!).

⚠️ MACHINE LEARNING TRAPS

  • TRAP Data Leakage: Scaling or imputing missing values BEFORE train/test split. Always fit scaler on Train only!
  • TRAP Overfitting: 99% train accuracy but 60% test accuracy. The model memorized the noise.
  • TRAP Thresholding: `.predict()` defaults to 0.5 threshold. You might need to change it via `.predict_proba()` to optimize Recall vs Precision.