STATISTICS SURVIVAL WALL // Andy's Edition // PG 01

CORE FOUNDATIONS & THEORIES

Greek Symbols & Notation

μ	Population Mean
x̄	Sample Mean
σ	Population Std Dev
s	Sample Std Dev
α	Alpha (Type I Error Rate)
β	Beta (Type II Error Rate)
χ²	Chi-Squared Statistic
F	F-Statistic (ANOVA/Reg)
r	Pearson Correlation
R²	Variance Explained
η²	Eta Squared (Effect)
d	Cohen's d (Effect)
n	Sample Size
df	Degrees of Freedom

Descriptive Stats

Mean: Average. Sensitive to outliers.
Median: Middle value (50th percentile). Robust.
Mode: Most frequent value.
Variance (s²): Avg squared deviation from mean.
Std Dev (s): Square root of variance. In original units.
IQR: Q3 - Q1. Middle 50% of data.

Distribution Concepts

Population: The entire group (μ, σ).
Sample: Subset (x̄, s).
Sampling Dist: Dist of a statistic (e.g., means) across infinite samples.

Central Limit Theorem

As sample size (n ≥ 30) increases, the sampling distribution of the mean becomes normally distributed, regardless of the population's shape.

Normal Distribution

Symmetrical, bell-shaped.
68-95-99.7 Rule: % of data within 1, 2, and 3 standard deviations.

Confidence Intervals (CI)

Range containing the true parameter with X% confidence.

Level	Z-Crit	Interpretation
90%	1.645	Narrow, higher error
95%	1.960	Standard baseline
99%	2.576	Wide, conservative

Alpha, Beta & Power

Decision	H0 True	H0 False
Reject H0	Type I Error (α)	True Pos (Power)
Fail Reject	True Negative	Type II Error (β)

α (Alpha): Prob of rejecting a true H0 (False Positive).
β (Beta): Prob of failing to reject false H0 (False Negative).
Power (1 - β): Prob of correctly rejecting a false H0.

Hypothesis Workflow

1. Formulate Question

↓

2. State H0 & HA

↓

3. Choose Alpha (α)

↓

4. Check Assumptions

↓

5. Choose & Run Test

↓

6. Interpret p-value

↓

7. Check Effect Size

Anatomy of a Statistical Output

Most software prints a cryptic block. Here is how to decode the matrix.

> t.test(Group_A, Group_B, var.equal=FALSE)

    Welch Two Sample t-test

data: Group_A and Group_B
t = 2.841, df = 47.3, p-value = 0.0067
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.3225   7.1031
sample estimates:
mean of x  mean of y
126.33     122.11

t (Test Statistic)
The calculated signal-to-noise ratio. Larger absolute value = stronger signal.

df (Degrees of Freedom)
Sample size adjusted for parameters. With Welch, this gets fractional (47.3).

p-value (Decision)
Probability of data assuming H0 is true. If p < α (e.g. 0.05), Reject H0.

CI (Plausible Range)
If the CI does NOT cross 0, the difference is statistically significant.

Statistical Mantras

STANDARD RULE:

If the p is low,
the null (H0) must go!

MASTER CUSTOM RULE:

If the p is high,
the 1 (H1) must die! 🍌

Fail to reject H0 ≠ Accept H0. (Lack of evidence is not proof of absence).
Significant ≠ Important. (Check Effect Size!)
Correlation ≠ Causation.

STATISTICS SURVIVAL WALL // Andy's Edition // PG 02

TEST SELECTION PROTOCOL & CORE TESTS

Test Selection Terminal

WHAT IS YOUR OUTCOME VARIABLE? │ ├── NUMERICAL (Continuous/Interval) │ │ │ ├── Compare to a fixed target value ─────────────> One Sample t-Test │ │ │ ├── Compare groups (Categorical Predictor) │ │ │ │ │ ├── 2 Independent Groups ────────────────────> Welch Two-Sample t │ │ │ │ │ ├── 2 Related/Paired Groups (Before/After) ──> Paired t-Test │ │ │ │ │ └── 3+ Independent Groups ───────────────────> One-Way ANOVA │ │ │ └── Relationship (Numerical Predictor) │ │ │ ├── Linear Relationship ─────────────────────> Pearson Correlation │ └── Monotonic (Ranked) Relationship ─────────> Spearman Correlation │ └── CATEGORICAL (Nominal/Ordinal) │ ├── Proportion (1 Variable, Binary Outcome) ─────> Binomial Test │ ├── Frequencies fit expected distribution? ──────> Chi² Goodness of Fit │ └── Association between 2 Variables ─────────────> Chi² Independence

Assumption Failure Matrix

Parametric tests require normality and equal variance. If assumptions fail, use non-parametric equivalents (Rank-based).

Wanted Test	Broken Assumption	Use Instead
Indep. t-Test	Unequal Variance	Welch t-Test
Indep. t-Test	Non-Normal (Outliers)	Mann-Whitney U
Paired t-Test	Non-Normal	Wilcoxon Signed Rank
ANOVA	Non-Normal	Kruskal-Wallis
Pearson (r)	Non-Normal / Non-Linear	Spearman (ρ)

1. Welch Two-Sample t

H0: μ1 = μ2.
Compares means of 2 independent groups. Doesn't assume equal variance (Standard!).

Python
stats.ttest_ind(a, b, equal_var=False)
                R
t.test(val ~ grp, var.equal=FALSE)
            

2. Paired t-Test

H0: μ_diff = 0.
Compares means from same group at different times (Before vs After).

Python
stats.ttest_rel(before, after)
                R
t.test(before, after, paired=TRUE)
            

3. One-Way ANOVA

H0: μ1 = μ2 = μ3...
Compares means of 3+ groups. If p<0.05, at least one group differs. (Needs Post-Hoc!).

Python
stats.f_oneway(g1, g2, g3)
                R
summary(aov(val ~ grp, data=df))
            

4. Chi-Square Indep.

H0: Variables are independent.
Tests association between 2 categorical variables (e.g. Pet Type x City).

Python
chi2_contingency(pd.DataFrame(obs))
                R
chisq.test(table_data)
            

5. Pearson Correlation

H0: r = 0.
Measures linear strength between 2 continuous variables (-1 to 1).

Python
stats.pearsonr(df['x'], df['y'])
                R
cor.test(df$x, df$y)
            

6. Mann-Whitney U

H0: Distributions equal.
Non-parametric alternative to independent t-test (compares ranks, not means).

Python
stats.mannwhitneyu(grp_a, grp_b)
                R
wilcox.test(x, y, paired=FALSE)
            

7. Wilcoxon Signed Rank

H0: Median diff = 0.
Non-parametric alternative to paired t-test.

Python
stats.wilcoxon(before, after)
                R
wilcox.test(before, after, paired=TRUE)
            

8. Kruskal-Wallis

H0: Medians equal.
Non-parametric alternative to ANOVA (3+ groups).

Python
stats.kruskal(g1, g2, g3)
                R
kruskal.test(val ~ grp)
            

STATISTICS SURVIVAL WALL // Andy's Edition // PG 03

REGRESSION, EFFECT SIZES & REPORTING

Linear & Multiple Regression

y = β0 + β1x1 + β2x2 + ... + ε

Models the relationship between predictors (x) and a continuous outcome (y).

Regression Decoder:
Intercept (β0): Predicted y when all x = 0.
Coefficient (Slope β1): Expected change in y for a 1-unit increase in x, holding other x constant.
Residual (ε): Distance from data point to regression line.
R² (R-squared): % of variance in y explained by the model.
Adjusted R²: R² penalized for adding useless predictors (always use this for Multiple Reg).

Assumptions & Diagnostics:

Linearity: Plot residuals vs fitted.
Homoscedasticity: Equal variance of residuals.
Normality: Q-Q Plot of residuals.
Multicollinearity: Check VIF (Variance Inflation Factor). If VIF > 5, predictors are highly correlated.

Python

                    import statsmodels.formula.api as smf

                    smf.ols("y ~ x1 + x2", data=df).fit().summary()
                    R

                    summary(lm(y ~ x1 + x2, data=df))

Logistic Regression

Predicts a Binary outcome (0/1, Yes/No).

Odds: P(Event) / P(No Event).
Logit: Natural log of the odds. The model runs linearly on logits.
Odds Ratio (OR): e^(Coef). OR > 1 → Increases likelihood OR = 1 → No effect OR < 1 → Decreases likelihood

Python

                    smf.logit("y ~ x", data=df).fit()
                    R

                    glm(y~x, family=binomial)

Post-Hoc Analysis

ANOVA answers: "Is there ANY difference?"
Tukey answers: "WHICH groups differ?"

Tukey HSD: Compares all pairs safely.
Bonferroni: p_new = p * (num_tests). Strict!

Python (Tukey)

                    pairwise_tukeyhsd(df['y'], df['grp'])

Effect Sizes (Practical Significance)

STATISTICAL SIGNIFICANCE ≠ PRACTICAL SIGNIFICANCE

A p-value only tells you an effect exists. Effect size tells you if you should care.

Cohen's d (t-Test) 0.2 = Small
0.5 = Medium
0.8 = Large

Pearson r (Corr) 0.1 = Weak
0.3 = Moderate
0.5 = Strong

Eta² η² (ANOVA) 0.01 = Small
0.06 = Medium
0.14 = Large

Reporting Checklist

Always report these elements in APA format:

Test used
Test statistic (t, F, χ²)
Degrees of freedom (df)
Exact p-value (p = 0.034)
Effect size (d, η², OR)
Confidence interval (CI)

Python ↔ R Translation

Concept	Python	R
Mean	`df.mean()`	`mean(x)`
SD	`df.std()`	`sd(x)`
Summary	`df.describe()`	`summary(x)`
Crosstab	`pd.crosstab()`	`table()`
Plot	`seaborn`	`ggplot2`

STATISTICS SURVIVAL WALL // Andy's Edition // PG 04

MACHINE LEARNING & MODEL EVALUATION

Data Splitting & Validation

Never test a model on the data it trained on (Data Leakage / Overfitting).

Train Set (70-80%): Used by the algorithm to learn patterns.
Test Set (20-30%): Unseen data to evaluate final performance.
Validation Set: Used to tune hyperparameters.

k-Fold Cross Validation

Splits data into 'k' chunks. Trains on k-1, tests on the 1 remaining. Repeats k times. Averages the scores. More robust than a single split.

Python (Scikit-Learn)

                from sklearn.model_selection import train_test_split

                X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Classification: The Confusion Matrix

Compares Actual Truth vs. Model Prediction for binary classification.

		Predicted Condition
		Pred Positive (1)	Pred Negative (0)
Actual	Pos (1)	True Positive (TP)	False Negative (FN) - Type II Error
Actual	Neg (0)	False Positive (FP) - Type I Error	True Negative (TN)

Imbalanced Data Warning:

If 99% of emails are normal and 1% is spam, a model predicting "normal" every time has 99% accuracy but is completely useless. Do not use Accuracy for imbalanced data!

Fixes: SMOTE (Oversampling), Undersampling, Class Weights.

Evaluation Metrics

Accuracy: (TP+TN) / Total
Overall correctness (Bad for imbalanced)
Precision: TP / (TP + FP)
When it predicts Yes, how often is it right? (Minimize False Positives)
Recall (Sensitivity): TP / (TP + FN)
Out of all actual Yes, how many did we find? (Minimize False Negatives)
F1-Score: 2 * (Prec * Rec) / (Prec + Rec)
Harmonic mean of Precision and Recall.

ROC Curve & AUC

ROC (Receiver Operating Characteristic): Plots True Positive Rate (Recall) vs False Positive Rate across different probability thresholds.

AUC (Area Under Curve): A single metric evaluating model performance independent of threshold.

AUC = 0.5: Useless (Random guessing).
AUC = 0.7 - 0.8: Acceptable.
AUC = 0.8 - 0.9: Excellent.
AUC = 1.0: Perfect (Suspicious!).

⚠️ MACHINE LEARNING TRAPS

TRAP Data Leakage: Scaling or imputing missing values BEFORE train/test split. Always fit scaler on Train only!
TRAP Overfitting: 99% train accuracy but 60% test accuracy. The model memorized the noise.
TRAP Thresholding: `.predict()` defaults to 0.5 threshold. You might need to change it via `.predict_proba()` to optimize Recall vs Precision.