The Researcher's Complete Guide to Statistical Test Selection

The Researcher's Complete Guide to Statistical Test Selection

StatisticsResearchData AnalysisHypothesis TestingANOVAT-TestCorrelationData Science

The Researcher's Complete Guide to Statistical Test Selection

1. From Confusion to Clarity: A Decision-Tree Approach to Hypothesis Testing


"Statistics is the grammar of science." β€” Karl Pearson

But what good is grammar if you don't know which words to use?

If you've ever stared at your data wondering whether to run a t-test or Mann-Whitney, ANOVA or Kruskal-Wallis, Pearson or Spearmanβ€”you're not alone. This guide exists because statistical test selection shouldn't require a PhD in mathematics. It should be a decision tree you can follow.

What makes this guide different:

  • Decision trees first: Navigate to your test, don't memorize everything
  • Engineer-friendly math: Formulas that make intuitive sense
  • "Don't use this when" sections: Knowing what NOT to do is half the battle
  • Real examples: From actual research scenarios
  • Visual explanations: Geometry over algebra wherever possible

Table of Contents

  1. Chapter 1: The Foundation β€” What You Must Know First
  2. Chapter 2: Understanding Distributions β€” The Shape of Your Data
  3. Chapter 3: The Master Decision Tree β€” Your Navigation System
  4. Chapter 4: Assumption Testing β€” The Gates You Must Pass
  5. Chapter 5: Comparing Groups β€” The Core Tests
  6. Chapter 6: Relationships Between Variables β€” Correlation Tests
  7. Chapter 7: Effect Size β€” The Forgotten Hero
  8. Chapter 8: Categorical Data Analysis
  9. Chapter 9: Advanced Techniques
  10. Chapter 10: Bayesian Alternatives β€” The Other Paradigm
  11. Chapter 11: Statistical Sins Researchers Commit
  12. Chapter 12: Quick Reference & Cheat Sheets

Chapter 1: The Foundation β€” What You Must Know First

Before we dive into test selection, let's establish the fundamental concepts that everything else builds upon. Think of this chapter as learning the alphabet before writing sentences.

2. 1.1 The Two Tribes: Frequentist vs. Bayesian

Statistical inference has two major philosophical camps. Understanding which camp your chosen test belongs to helps you interpret results correctly.

2.1. Frequentist Approach (The Dominant Paradigm)

Core belief: Probability represents long-run frequencies of events.

Imagine flipping a coin 10,000 times. The frequentist says: "The probability of heads is the proportion of heads I'd get if I repeated this infinitely."

Key characteristics:

  • Parameters (like population mean) are fixed but unknown
  • Data is random
  • We calculate: "What's the probability of seeing this data IF the null hypothesis is true?"
  • Results in p-values and confidence intervals

Analogy: A frequentist is like a factory quality inspector who tests 1000 light bulbs to estimate the defect rate. The true defect rate is fixed; they're trying to estimate it through repeated sampling.

2.2. Bayesian Approach (The Rising Alternative)

Core belief: Probability represents degrees of belief or certainty.

Key characteristics:

  • Parameters have probability distributions (they're uncertain)
  • We update prior beliefs with data to get posterior beliefs
  • We calculate: "What's the probability of this hypothesis GIVEN the data I observed?"
  • Results in posterior distributions and credible intervals

Analogy: A Bayesian is like a detective who starts with hunches (priors) and updates their beliefs as evidence comes in. Their certainty about who committed the crime changes with each clue.

1.1_frequentist_bayesian_flowchart1.1_frequentist_bayesian_flowchart

1.1_Frequentist_Bayesian1.1_Frequentist_Bayesian


3. 1.2 The P-Value: Most Misunderstood Statistic in Science

3.1. What a P-Value Actually Is

Definition: The p-value is the probability of obtaining results at least as extreme as the observed results, assuming the null hypothesis is true.

Let's break this down with an analogy:

The Courtroom Analogy:

  • Null hypothesis (Hβ‚€): The defendant is innocent
  • Alternative hypothesis (H₁): The defendant is guilty
  • Evidence: Your data
  • P-value: How surprising would this evidence be IF the defendant were truly innocent?

A small p-value (say, 0.001) means: "If the defendant were innocent, there's only a 0.1% chance we'd see evidence this damning. This evidence is very surprising under the innocence assumption."

3.2. The Mathematical Formula

p=P(Tβ‰₯tobserved∣H0 is true)p = P(T \geq t_{observed} \mid H_0 \text{ is true})

Where:

  • T is the test statistic (a number summarizing your data)
  • t_observed is the actual value you calculated from your sample
  • Hβ‚€ is the null hypothesis

Geometric Interpretation:

1.2_pvalue1.2_pvalue

3.3. What P-Values Are NOT

Common MisconceptionReality
"P-value is the probability the null hypothesis is true"NO. It's P(data|Hβ‚€), not P(Hβ‚€|data)
"P = 0.05 means 5% chance results are due to chance"NO. It means IF Hβ‚€ is true, 5% of experiments would show results this extreme
"P < 0.05 means my effect is important/large"NO. Statistical significance β‰  practical significance
"P = 0.06 is 'trending toward significance'"NO. Either your threshold is 0.05 or it isn't. There's no "trending."
"Non-significant p means no effect exists"NO. Absence of evidence β‰  evidence of absence

3.4. Computing P-Values: The General Process

  1. State your hypotheses (Hβ‚€ and H₁)
  2. Choose a test statistic (t, F, χ², Z, etc.)
  3. Calculate the test statistic from your data
  4. Find the probability of getting a value this extreme or more extreme under Hβ‚€
    • This requires knowing the distribution of your test statistic under Hβ‚€
    • Usually involves looking up tables or using software

Example Calculation:

Suppose you're testing if a new interface design reduces task completion time. Your null hypothesis is "no difference."

You collect data and calculate a t-statistic of 2.45 with 28 degrees of freedom.

The p-value = P(|T| β‰₯ 2.45 | Hβ‚€ is true)

Looking at a t-distribution with df=28, the area in both tails beyond Β±2.45 is approximately 0.021.

So p β‰ˆ 0.021, meaning: "If there truly were no difference, we'd see a result this extreme only about 2.1% of the time."

1.2_pValue_Visualization1.2_pValue_Visualization


4. 1.3 Hypothesis Testing: The Scientific Ritual

4.1. The Five-Step Framework

Every statistical test follows this ritual:

1.3_hypothesis_testing_ritual1.3_hypothesis_testing_ritual

4.2. Type I and Type II Errors: The Two Ways to Be Wrong

Hβ‚€ is Actually TrueHβ‚€ is Actually False
Reject Hβ‚€Type I Error (Ξ±) False Positive "Crying wolf"Correct! True Positive (Power = 1-Ξ²)
Fail to Reject Hβ‚€Correct! True NegativeType II Error (Ξ²) False Negative "Missing the wolf"

Memorable analogy:

  • Type I Error: Convicting an innocent person (false alarm)
  • Type II Error: Letting a guilty person go free (missed detection)

The Trade-off: Reducing Type I errors (being more conservative) increases Type II errors, and vice versa. You can't minimize both simultaneously with fixed sample size.

4.3. One-Tailed vs. Two-Tailed Tests

Two-tailed test: You're testing for ANY difference (could be higher OR lower)

  • H₁: μ₁ β‰  ΞΌβ‚‚
  • P-value considers both tails of the distribution
  • Use when you have no directional prediction

One-tailed test: You're testing for a SPECIFIC direction

  • H₁: μ₁ > ΞΌβ‚‚ (or μ₁ < ΞΌβ‚‚)
  • P-value considers only one tail
  • Use when you have a strong directional hypothesis BEFORE collecting data

1.3_onetailed_vs_twotailed1.3_onetailed_vs_twotailed

⚠️ Don't use this when: You decide to use one-tailed AFTER seeing your data goes in a particular direction. This is p-hacking!


5. 1.4 Statistical Power: Your Test's Ability to Detect Effects

5.1. What is Statistical Power?

Power = Probability of correctly rejecting Hβ‚€ when it's actually false

Power = 1 - Ξ² (where Ξ² is the Type II error rate)

Analogy: Power is like a metal detector's sensitivity. A high-power detector will find buried treasure (true effect) most of the time. A low-power detector will miss treasure that's actually there.

5.2. The Four Factors Affecting Power

1.4_fourFactorsAffectingPower1.4_fourFactorsAffectingPower

  1. Sample size (n): Larger samples = More power
  2. Effect size (d): Larger effects = Easier to detect = More power
  3. Significance level (Ξ±): Higher Ξ± = More power (but more false positives)
  4. Variance (σ²): More variability = Harder to detect signal = Less power

5.3. How Much Power Do You Need?

Convention: Aim for at least 80% power (Ξ² = 0.20)

This means: If there truly is an effect, you have an 80% chance of detecting it.

A priori power analysis: Calculate required sample size BEFORE collecting data Post-hoc power analysis: ⚠️ Generally discouraged (see Statistical Sins chapter)


6. 1.5 Degrees of Freedom: The Hidden Constraint

6.1. What Are Degrees of Freedom?

Degrees of freedom (df) = The number of independent values that can vary in your calculation

The Party Seating Analogy:

Imagine you're seating 5 guests at a round table with 5 chairs.

  • Guest 1: Can sit anywhere (5 choices)
  • Guest 2: Can sit in any of 4 remaining chairs
  • Guest 3: 3 remaining chairs
  • Guest 4: 2 remaining chairs
  • Guest 5: Only 1 chair left β€” NO CHOICE!

You had freedom for 4 guests; the 5th was constrained. df = 5 - 1 = 4

6.2. Why Degrees of Freedom Matter

Different statistical distributions are actually families of distributions, with df as the parameter that determines the exact shape:

1.5_df_Matters1.5_df_Matters

6.3. Common Degrees of Freedom Formulas

TestDegrees of Freedom
One-sample t-testn - 1
Independent samples t-testn₁ + nβ‚‚ - 2
Paired samples t-testn - 1 (n = number of pairs)
One-way ANOVA (between groups)k - 1 (k = number of groups)
One-way ANOVA (within groups)N - k (N = total observations)
Chi-square test(rows - 1) Γ— (columns - 1)
Correlationn - 2

Chapter 2: Understanding Distributions β€” The Shape of Your Data

Before selecting a statistical test, you must understand your data's distribution. This chapter covers the probability distributions you'll encounter most frequently.

7. 2.1 Why Distributions Matter

Every statistical test makes assumptions about how your data is distributed. Use the wrong test for your distribution, and your results may be meaningless.

The Key Question: If I could collect infinite samples, what shape would the histogram take?

2.1_distributions_matter2.1_distributions_matter


8. 2.2 The Normal (Gaussian) Distribution

8.1. The Bell Curve β€” Queen of Distributions

Why it's special: Thanks to the Central Limit Theorem, the average of many independent random variables tends toward normal distribution, regardless of the original distribution. This is why it appears everywhere.

8.2. Mathematical Formula

f(x)=1Οƒ2Ο€eβˆ’(xβˆ’ΞΌ)22Οƒ2f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}

Breaking it down:

  • ΞΌ (mu): Mean β€” the center of the bell
  • Οƒ (sigma): Standard deviation β€” the "width" of the bell
  • e: Euler's number (~2.718)
  • Ο€: Pi (~3.14159)

Intuitive understanding: The formula says "the probability density decreases exponentially as you move away from the mean, with the rate of decrease controlled by the standard deviation."

8.3. The 68-95-99.7 Rule (Empirical Rule)

2.2_68_95_99_rule2.2_68_95_99_rule

  • 68% of data falls within Β±1 standard deviation
  • 95% of data falls within Β±2 standard deviations
  • 99.7% of data falls within Β±3 standard deviations

8.4. Real-World Examples

  • Human height (within a sex)
  • IQ scores
  • Measurement errors
  • Blood pressure readings

8.5. When Data is NOT Normal

Your data might not be normal if:

  • It's bounded (can't go below 0, like reaction times)
  • It has heavy tails (extreme values more common)
  • It's skewed (asymmetric)
  • It's multimodal (multiple peaks)

2.1_Distribution_Shapes2.1_Distribution_Shapes


9. 2.3 The t-Distribution

9.1. The Normal Distribution's Cautious Cousin

When you need it: When you're estimating a population mean but:

  • Sample size is small (typically n < 30)
  • Population standard deviation is unknown (you're estimating it from data)

9.2. How It Differs From Normal

2.3_tDist2.3_tDist

The key insight: With small samples, our estimate of the standard deviation is uncertain. The t-distribution accounts for this extra uncertainty by putting more probability in the tails.

9.3. The Shape Parameter: Degrees of Freedom

As df increases, t-distribution β†’ normal distribution

dfHow close to normal?
1Very different (Cauchy distribution)
5Noticeably heavier tails
30Nearly indistinguishable
∞Exactly normal

10. 2.4 The Chi-Square (χ²) Distribution

10.1. The Distribution of Squared Deviations

What it represents: The sum of squared standard normal random variables.

Ο‡2=βˆ‘i=1kZi2\chi^2 = \sum_{i=1}^{k} Z_i^2

where each Zα΅’ is a standard normal variable (mean=0, sd=1)

10.2. Shape Characteristics

2.4_ChiSquaredDist2.4_ChiSquaredDist

Key properties:

  • Always β‰₯ 0 (it's a sum of squares!)
  • Right-skewed, especially for low df
  • Mean = df
  • Variance = 2 Γ— df

10.3. Where You'll Encounter It

  • Chi-square test for independence
  • Goodness-of-fit tests
  • Variance tests
  • As part of F-distribution

11. 2.5 The F-Distribution

11.1. The Ratio of Two Chi-Squares

What it represents: The ratio of two chi-square distributions (each divided by their df)

F=Ο‡12/df1Ο‡22/df2F = \frac{\chi_1^2 / df_1}{\chi_2^2 / df_2}

Intuition: It compares two variances. How much bigger is one variance relative to another?

11.2. Shape Characteristics

2.5_F_dist2.5_F_dist

11.3. Where You'll Encounter It

  • ANOVA (comparing multiple group means)
  • Regression (overall model significance)
  • Levene's test (comparing variances)

12. 2.6 Discrete Distributions

12.1. Bernoulli Distribution: Single Yes/No Trial

Parameters: p (probability of success)

Examples: Single coin flip, single survey response (yes/no)

P(X=1)=pP(X = 1) = p P(X=0)=1βˆ’pP(X = 0) = 1 - p

12.2. Binomial Distribution: Multiple Yes/No Trials

Parameters: n (number of trials), p (probability of success per trial)

Formula: P(X=k)=(nk)pk(1βˆ’p)nβˆ’kP(X = k) = \binom{n}{k} p^k (1-p)^{n-k}

Intuition: "Out of n independent trials, what's the probability of exactly k successes?"

2.6_binomial2.6_binomial

12.3. Poisson Distribution: Counting Rare Events

Parameter: Ξ» (lambda) β€” average rate of occurrence

Formula: P(X=k)=Ξ»keβˆ’Ξ»k!P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}

Use when: Counting events in a fixed interval (time, space, etc.)

  • Number of errors per page
  • Number of customers per hour
  • Number of mutations per genome

2.6_poisson2.6_poisson


13. 2.7 Other Important Distributions

13.1. Exponential Distribution

Use for: Time between events (waiting times)

  • Time until next customer arrives
  • Time until system failure

Shape: Starts high at 0, decreases exponentially

13.2. Log-Normal Distribution

Use for: Data that's normal after taking logarithm

  • Income distributions
  • Reaction times
  • Stock prices

Shape: Right-skewed, bounded at 0

13.3. Uniform Distribution

Use for: All values equally likely in a range

  • Random number generators
  • Rolling a fair die

Shape: Flat rectangle

2.2_Distribution_Family2.2_Distribution_Family


14. 2.8 Distribution Selection Quick Reference

Your Data Looks Like...Likely DistributionCommon Tests
Symmetric bell curveNormalt-tests, ANOVA, Pearson
Right-skewed, continuousLog-normal or ExponentialNon-parametric tests, or transform
Counts (0, 1, 2, 3...)Poisson or BinomialChi-square, Poisson regression
Yes/No outcomesBernoulli/BinomialChi-square, logistic regression
Rankingsβ€”Non-parametric tests
Bounded scoresBetaTransform or non-parametric

Chapter 3: The Master Decision Tree β€” Your Navigation System

This is the heart of the guide. Use these decision trees to navigate from "What do I want to know?" to "Which test should I use?"

15. 3.1 The Ultimate Decision Tree (ASCII Version)

3.1_decisionTree3.1_decisionTree


16. 3.2 TREE A: Comparing Groups

3.2_treeA3.2_treeA


17. 3.3 TREE B: Measuring Relationships (Correlation)

3.3_treeB3.3_treeB


18. 3.4 TREE C: Categorical Data Analysis

3.4_treeC3.4_treeC


19. 3.5 TREE D: Parametric vs Non-Parametric Decision

3.5_treeD3.5_treeD

3.1_DecisionTreePoster3.1_DecisionTreePoster

3.2_Parametric_vs_Nonparametric3.2_Parametric_vs_Nonparametric


Chapter 4: Assumption Testing β€” The Gates You Must Pass

Before running parametric tests, you must verify their assumptions are met. Think of these tests as gatekeepers that determine which path you can take.

20. 4.1 Testing for Normality

20.1. Shapiro-Wilk Test

Purpose: Tests whether a sample comes from a normally distributed population.

Hypotheses:

  • Hβ‚€: Data is normally distributed
  • H₁: Data is NOT normally distributed

When to use:

  • Sample size < 50 (most powerful for small samples)
  • ⚠️ With large samples (n > 300), may detect trivial deviations

Interpretation:

  • p > 0.05 β†’ Fail to reject Hβ‚€ β†’ Data is approximately normal βœ“
  • p < 0.05 β†’ Reject Hβ‚€ β†’ Data is NOT normal βœ—

The Formula (conceptual):

W=(βˆ‘i=1naix(i))2βˆ‘i=1n(xiβˆ’xΛ‰)2W = \frac{\left(\sum_{i=1}^{n} a_i x_{(i)}\right)^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}

Where:

  • xβ‚α΅’β‚Ž = ordered sample values
  • aα΅’ = constants generated from means and covariances of order statistics
  • W ranges from 0 to 1; values close to 1 suggest normality

⚠️ Don't use this when:

  • Sample size > 5000 (use visual inspection instead)
  • You have clear theoretical reasons to expect non-normality

20.2. Other Normality Tests

TestBest ForNotes
Shapiro-WilkSmall samples (n < 50)Most powerful
Kolmogorov-SmirnovLarger samplesLess powerful than Shapiro-Wilk
Anderson-DarlingMedium samplesMore sensitive to tails
D'Agostino-PearsonLarge samples (n > 20)Tests skewness and kurtosis separately

20.3. Visual Methods (Often Better Than Tests!)

Q-Q Plot (Quantile-Quantile Plot):

4.1_q-q_plot4.1_q-q_plot

Histogram with Normal Curve Overlay:

  • Visual check for bell shape
  • Quick identification of skewness, multiple modes

4.1_Q-QPlots4.1_Q-QPlots


21. 4.2 Testing for Homogeneity of Variance

21.1. Levene's Test

Purpose: Tests whether variances are equal across groups.

Hypotheses:

  • Hβ‚€: Variances are equal (homogeneity)
  • H₁: Variances are NOT equal (heterogeneity)

When to use:

  • Before independent samples t-test or ANOVA
  • More robust than Bartlett's test when normality is violated

The Formula:

W=(Nβˆ’k)(kβˆ’1)Γ—βˆ‘i=1kni(Ziβ‹…βˆ’Zβ‹…β‹…)2βˆ‘i=1kβˆ‘j=1ni(Zijβˆ’Ziβ‹…)2W = \frac{(N-k)}{(k-1)} \times \frac{\sum_{i=1}^{k} n_i (Z_{i\cdot} - Z_{\cdot\cdot})^2}{\sum_{i=1}^{k} \sum_{j=1}^{n_i} (Z_{ij} - Z_{i\cdot})^2}

Where:

  • Zα΅’β±Ό = |Yα΅’β±Ό - Θ²α΅’| (absolute deviation from group mean)
  • k = number of groups
  • N = total sample size

Interpretation:

  • p > 0.05 β†’ Variances are equal βœ“ β†’ Proceed with standard tests
  • p < 0.05 β†’ Variances are NOT equal βœ— β†’ Use Welch's t-test or robust ANOVA

21.2. Comparison of Variance Tests

TestAssumes Normality?Robustness
Levene's (median-based)NoMost robust
Brown-ForsytheNoVery robust (uses median)
Bartlett'sYesSensitive to non-normality

⚠️ Don't use this when:

  • Groups are very unequal in size (ratio > 4:1)
  • Running paired/repeated measures designs

22. 4.3 Testing for Sphericity (Repeated Measures)

22.1. Mauchly's Test of Sphericity

Purpose: Tests whether the variances of differences between all pairs of conditions are equal. Required assumption for repeated measures ANOVA.

Hypotheses:

  • Hβ‚€: Sphericity assumption is met
  • H₁: Sphericity assumption is violated

When to use:

  • Before repeated measures ANOVA
  • When you have 3+ repeated conditions

Why sphericity matters:

4.3_sphericity_test4.3_sphericity_test

Interpretation:

  • p > 0.05 β†’ Sphericity is met βœ“ β†’ Use standard repeated measures ANOVA
  • p < 0.05 β†’ Sphericity is violated βœ— β†’ Apply correction:
    • Greenhouse-Geisser: Conservative, use when Ξ΅ < 0.75
    • Huynh-Feldt: Less conservative, use when Ξ΅ β‰₯ 0.75

The Epsilon (Ξ΅) Correction:

  • Ξ΅ = 1.0 means perfect sphericity
  • Lower Ξ΅ means worse violation
  • Corrections multiply degrees of freedom by Ξ΅, making test more conservative

23. 4.4 Assumption Checking Flowchart

4.4_assumption_checking_flowchart4.4_assumption_checking_flowchart


24. 4.5 What To Do When Assumptions Are Violated

24.1. Decision Matrix

ViolationMinorModerateSevere
Non-normalityProceed (t-test is robust)Transform dataUse non-parametric
Unequal variancesUse Welch's correctionUse Welch's + bootstrapUse non-parametric
SphericityUse Huynh-FeldtUse Greenhouse-GeisserUse MANOVA or mixed models
OutliersWinsorizeTrimRemove and report sensitivity analysis

24.2. Data Transformations

Original DistributionTransformationWhen to Use
Right-skewedLog(x)Positive data, multiplicative effects
Right-skewed√xCount data
Right-skewed1/xExtreme skew
Left-skewedxΒ²Already positive data
Proportionsarcsin(√x)Proportions bounded 0-1

⚠️ Caution with transformations:

  • Interpretation changes (log-transformed means are geometric means)
  • Must transform back for reporting
  • May not fully fix violations

Chapter 5: Comparing Groups β€” The Core Tests

This chapter covers the tests you'll use most frequently: comparing means between groups.

25. Template for Each Test

Every test in this chapter follows this structure:

  1. What it tests
  2. When to use it
  3. Assumptions
  4. The formula (with intuitive explanation)
  5. Example
  6. When NOT to use it
  7. Reporting format

26. 5.1 The Z-Test

26.1. What It Tests

Tests whether a sample mean differs from a known population mean when the population standard deviation is known.

26.2. When to Use It

  • You know the population standard deviation (Οƒ) β€” rare in practice!
  • Large sample size (n β‰₯ 30)
  • Comparing sample to a known benchmark

26.3. Assumptions

  • βœ“ Population standard deviation is known
  • βœ“ Data is continuous
  • βœ“ Random sampling
  • βœ“ Normal distribution (or large sample)

26.4. The Formula

Z=XΛ‰βˆ’ΞΌ0Οƒ/nZ = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}

Breaking it down:

5.1_zTest5.1_zTest

  • Numerator: How different is your sample mean from the expected value?
  • Denominator: How much random variation would you expect in sample means of this size?
  • Z: How many standard errors away from expected is your result?

26.5. Example

Research question: Do users of our new interface complete tasks faster than the industry standard of 120 seconds?

  • Sample: n = 36 users
  • Sample mean: XΜ„ = 115 seconds
  • Population SD (known from industry data): Οƒ = 18 seconds

Z=115βˆ’12018/36=βˆ’53=βˆ’1.67Z = \frac{115 - 120}{18 / \sqrt{36}} = \frac{-5}{3} = -1.67

Interpretation: The sample mean is 1.67 standard errors below the population mean. Looking up Z = -1.67 in a standard normal table gives p β‰ˆ 0.095 (two-tailed).

Since p > 0.05, we fail to reject Hβ‚€. Not enough evidence that our interface differs from the industry standard.

26.6. ⚠️ Don't Use This When

  • Population standard deviation is unknown (use t-test instead)
  • Sample size is small AND population SD unknown
  • Data is clearly non-normal with small n

26.7. Reporting Format

"A one-sample Z-test indicated that task completion time (M = 115s) did not significantly differ from the industry standard of 120s, Z = -1.67, p = .095."


27. 5.2 One-Sample t-Test

27.1. What It Tests

Tests whether a sample mean differs from a known or hypothesized value when population SD is unknown.

27.2. When to Use It

  • Comparing a sample to a known/expected value
  • Population SD is unknown (estimated from sample)
  • Single group, single measurement

27.3. Assumptions

  • βœ“ Data is continuous (interval/ratio)
  • βœ“ Random sampling
  • βœ“ Approximately normal (or n β‰₯ 30)
  • βœ“ No extreme outliers

27.4. The Formula

t=XΛ‰βˆ’ΞΌ0s/nt = \frac{\bar{X} - \mu_0}{s / \sqrt{n}}

Where:

  • XΜ„ = sample mean
  • ΞΌβ‚€ = hypothesized population mean
  • s = sample standard deviation
  • n = sample size

The only difference from Z-test: We use sample SD (s) instead of population SD (Οƒ)

df=nβˆ’1df = n - 1

27.5. Why Degrees of Freedom Matter

When we estimate SD from the sample, we introduce uncertainty. The t-distribution accounts for this by having heavier tails than the normal distribution. With more data (higher df), we're more certain about our SD estimate, and the t-distribution approaches normal.

27.6. Example

Research question: Does our VR experience increase sense of presence compared to the neutral score of 4.0 on a 7-point scale?

  • Sample: n = 25 participants
  • Sample mean: XΜ„ = 5.2
  • Sample SD: s = 1.5

t=5.2βˆ’4.01.5/25=1.20.3=4.0t = \frac{5.2 - 4.0}{1.5 / \sqrt{25}} = \frac{1.2}{0.3} = 4.0

df = 25 - 1 = 24

Looking up t = 4.0 with df = 24: p < 0.001

Interpretation: The presence score is significantly above the neutral point.

27.7. ⚠️ Don't Use This When

  • Data is clearly non-normal (use Wilcoxon signed-rank)
  • You have severe outliers
  • Data is ordinal (ranks)
  • You're comparing two groups (use two-sample t-test)

27.8. Reporting Format

"A one-sample t-test revealed that presence scores (M = 5.2, SD = 1.5) were significantly above the neutral point of 4.0, t(24) = 4.00, p < .001, d = 0.80."


28. 5.3 Independent Samples t-Test

28.1. What It Tests

Tests whether the means of two independent groups differ.

28.2. When to Use It

  • Two separate groups (e.g., treatment vs. control)
  • Groups are independent (different people)
  • Comparing one continuous outcome

28.3. Assumptions

  • βœ“ Independence between groups
  • βœ“ Data is continuous
  • βœ“ Normal distribution in each group (or n β‰₯ 30 per group)
  • βœ“ Homogeneity of variance (equal variances)

28.4. The Formula

Standard (equal variances assumed):

t=XΛ‰1βˆ’XΛ‰2sp2(1n1+1n2)t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{s_p^2 \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}

Where pooled variance is:

sp2=(n1βˆ’1)s12+(n2βˆ’1)s22n1+n2βˆ’2s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}

Welch's t-test (unequal variances):

t=XΛ‰1βˆ’XΛ‰2s12n1+s22n2t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

28.5. Geometric Interpretation

5.3_inde_samples_t_test5.3_inde_samples_t_test

28.6. Example

Research question: Do users learn faster with gamified tutorials vs. traditional tutorials?

GamifiedTraditional
n2020
Mean (minutes)15.322.1
SD4.25.1

Levene's test: p = 0.31 (variances are equal βœ“)

Pooled variance: sp2=(19)(4.2)2+(19)(5.1)238=334.74+494.1938=21.81s_p^2 = \frac{(19)(4.2)^2 + (19)(5.1)^2}{38} = \frac{334.74 + 494.19}{38} = 21.81

t=15.3βˆ’22.121.81Γ—(0.05+0.05)=βˆ’6.82.181=βˆ’6.81.477=βˆ’4.60t = \frac{15.3 - 22.1}{\sqrt{21.81 \times (0.05 + 0.05)}} = \frac{-6.8}{\sqrt{2.181}} = \frac{-6.8}{1.477} = -4.60

df = 38, p < .001

Interpretation: Gamified tutorials led to significantly faster learning.

28.7. ⚠️ Don't Use This When

  • The same people are in both groups (use paired t-test)
  • You have more than 2 groups (use ANOVA)
  • Severe violation of normality with small samples (use Mann-Whitney U)
  • Variances are very unequal AND unequal n (use Welch's t-test)

28.8. Reporting Format

"An independent samples t-test showed that gamified tutorials (M = 15.3, SD = 4.2) led to significantly faster completion times than traditional tutorials (M = 22.1, SD = 5.1), t(38) = -4.60, p < .001, d = 1.46."


29. 5.4 Paired Samples t-Test

29.1. What It Tests

Tests whether the mean difference between paired observations is zero.

29.2. When to Use It

  • Same participants measured twice (pre/post)
  • Matched pairs (twins, matched controls)
  • Repeated measurements on same items

29.3. Assumptions

  • βœ“ Pairs are independent of other pairs
  • βœ“ Differences are continuous
  • βœ“ Differences are approximately normal
  • βœ“ No extreme outliers in differences

29.4. The Formula

t=DˉsD/nt = \frac{\bar{D}}{s_D / \sqrt{n}}

Where:

  • D = differences for each pair (X₁ - Xβ‚‚)
  • DΜ„ = mean of differences
  • sD = standard deviation of differences
  • n = number of pairs

Key insight: We're essentially running a one-sample t-test on the differences!

29.5. Why It's More Powerful Than Independent t-Test

5.4_pairedSamplesTtest5.4_pairedSamplesTtest

By using each person as their own control, we remove between-person variability.

29.6. Example

Research question: Does a 10-minute mindfulness exercise reduce stress?

ParticipantBeforeAfterDifference (D)
175-2
286-2
365-1
496-3
554-1
687-1
775-2
864-2
  • DΜ„ = -1.75
  • sD = 0.71
  • n = 8

t=βˆ’1.750.71/8=βˆ’1.750.251=βˆ’6.97t = \frac{-1.75}{0.71 / \sqrt{8}} = \frac{-1.75}{0.251} = -6.97

df = 7, p < .001

Interpretation: Stress significantly decreased after mindfulness exercise.

29.7. ⚠️ Don't Use This When

  • Observations are not truly paired
  • Differences are severely non-normal (use Wilcoxon signed-rank)
  • You have more than 2 time points (use repeated measures ANOVA)

29.8. Reporting Format

"A paired samples t-test indicated that stress scores significantly decreased from pre-intervention (M = 7.0, SD = 1.3) to post-intervention (M = 5.25, SD = 1.0), t(7) = -6.97, p < .001, d = 1.53."


30. 5.5 Mann-Whitney U Test (Wilcoxon Rank-Sum)

30.1. What It Tests

Tests whether two independent groups come from the same distribution. Non-parametric alternative to independent t-test.

30.2. When to Use It

  • Two independent groups
  • Data is ordinal OR continuous but non-normal
  • Outliers present
  • Small samples with uncertain distribution

30.3. Assumptions

  • βœ“ Independence between groups
  • βœ“ At least ordinal data
  • βœ“ Similar shape distributions (if comparing medians)

30.4. The Formula (Conceptual)

  1. Rank all observations (ignoring group membership)
  2. Sum the ranks for each group (R₁, Rβ‚‚)
  3. Calculate U:

U1=n1n2+n1(n1+1)2βˆ’R1U_1 = n_1 n_2 + \frac{n_1(n_1 + 1)}{2} - R_1

U2=n1n2+n2(n2+1)2βˆ’R2U_2 = n_1 n_2 + \frac{n_2(n_2 + 1)}{2} - R_2

Intuition: U counts how many times a value from one group beats a value from the other group.

30.5. Example

Research question: Do novice users give different usability ratings than experts?

Novices (Group A)Experts (Group B)
37
46
28
59

Combined and ranked:

  • 2(A)β†’1, 3(A)β†’2, 4(A)β†’3, 5(A)β†’4, 6(B)β†’5, 7(B)β†’6, 8(B)β†’7, 9(B)β†’8

R₁ (Novices) = 1+2+3+4 = 10 Rβ‚‚ (Experts) = 5+6+7+8 = 26

U₁ = 4Γ—4 + 4Γ—5/2 - 10 = 16 + 10 - 10 = 16 Uβ‚‚ = 4Γ—4 + 4Γ—5/2 - 26 = 16 + 10 - 26 = 0

U = min(U₁, Uβ‚‚) = 0

Interpretation: With U = 0, there's no overlap at all! Every expert rating exceeded every novice rating. This is highly significant.

30.6. ⚠️ Don't Use This When

  • Data meets t-test assumptions (t-test has more power)
  • You need to compare means specifically
  • You have paired/repeated data (use Wilcoxon signed-rank)

30.7. Reporting Format

"A Mann-Whitney U test revealed that expert usability ratings (Mdn = 7.5) were significantly higher than novice ratings (Mdn = 3.5), U = 0, p = .029, r = .87."


31. 5.6 Wilcoxon Signed-Rank Test

31.1. What It Tests

Tests whether the median difference between paired observations is zero. Non-parametric alternative to paired t-test.

31.2. When to Use It

  • Paired/repeated measures data
  • Differences are non-normal
  • Ordinal data
  • Small sample with uncertain distribution

31.3. Assumptions

  • βœ“ Pairs are independent
  • βœ“ At least ordinal data
  • βœ“ Symmetric distribution of differences (for median interpretation)

31.4. The Formula (Conceptual)

  1. Calculate differences for each pair
  2. Rank absolute differences (ignore zeros)
  3. Assign signs based on direction of difference
  4. Sum positive ranks (W⁺) and negative ranks (W⁻)
  5. Test statistic: W = min(W⁺, W⁻)

31.5. Example

Research question: Does a UI redesign improve satisfaction scores?

| User | Before | After | Difference | |Diff| | Rank | Signed Rank | |------|--------|-------|------------|--------|------|-------------| | 1 | 3 | 5 | +2 | 2 | 3.5 | +3.5 | | 2 | 4 | 4 | 0 | β€” | β€” | β€” | | 3 | 2 | 5 | +3 | 3 | 5 | +5 | | 4 | 5 | 4 | -1 | 1 | 1 | -1 | | 5 | 3 | 5 | +2 | 2 | 3.5 | +3.5 | | 6 | 2 | 4 | +2 | 2 | 3.5 | +3.5 |

W⁺ = 3.5 + 5 + 3.5 + 3.5 = 15.5 W⁻ = 1 W = 1

Interpretation: The small W⁻ suggests most differences were positive (improvement).

31.6. ⚠️ Don't Use This When

  • Data is normal (paired t-test has more power)
  • You have independent groups (use Mann-Whitney U)
  • You need to compare means specifically

31.7. Reporting Format

"A Wilcoxon signed-rank test indicated that satisfaction scores significantly increased after the redesign, W = 1, p = .046, r = .72."


32. 5.7 One-Way ANOVA

32.1. What It Tests

Tests whether means differ across three or more independent groups.

32.2. When to Use It

  • 3+ independent groups
  • One continuous outcome variable
  • One categorical independent variable (factor)

32.3. Assumptions

  • βœ“ Independence of observations
  • βœ“ Normal distribution within each group
  • βœ“ Homogeneity of variance
  • βœ“ No extreme outliers

32.4. The Formula (F-ratio)

F=MSbetweenMSwithin=Variance between groupsVariance within groupsF = \frac{MS_{between}}{MS_{within}} = \frac{\text{Variance between groups}}{\text{Variance within groups}}

Where:

MSbetween=SSbetweendfbetween=βˆ‘nj(XΛ‰jβˆ’XΛ‰)2kβˆ’1MS_{between} = \frac{SS_{between}}{df_{between}} = \frac{\sum n_j(\bar{X}_j - \bar{X})^2}{k-1}

MSwithin=SSwithindfwithin=βˆ‘βˆ‘(Xijβˆ’XΛ‰j)2Nβˆ’kMS_{within} = \frac{SS_{within}}{df_{within}} = \frac{\sum \sum (X_{ij} - \bar{X}_j)^2}{N-k}

32.5. Geometric Interpretation

5.7_onewayAnova5.7_onewayAnova

32.6. The ANOVA Table

SourceSSdfMSF
BetweenΞ£nβ±Ό(XΜ„β±Ό-XΜ„)Β²k-1SS_B/df_BMS_B/MS_W
WithinΣΣ(Xα΅’β±Ό-XΜ„β±Ό)Β²N-kSS_W/df_W
TotalΣΣ(Xα΅’β±Ό-XΜ„)Β²N-1

32.7. Example

Research question: Does font type affect reading speed?

Sans-serifSerifDecorative
12011595
12511890
11811288
12212092
11911785

Group means: 120.8, 116.4, 90.0 Grand mean: 109.07

SS_between = 5Γ—[(120.8-109.07)Β² + (116.4-109.07)Β² + (90-109.07)Β²] = 2,809.73 SS_within = Ξ£(deviations within groups)Β² = 154.0

F = (2809.73/2) / (154.0/12) = 1404.87 / 12.83 = 109.5

df_between = 2, df_within = 12, p < .001

Interpretation: Font type significantly affects reading speed.

32.8. Post-Hoc Tests

ANOVA tells you if groups differ, but not which groups. Post-hoc tests determine specific pairwise differences:

TestWhen to Use
Tukey HSDEqual n, all pairwise comparisons
BonferroniConservative, few planned comparisons
ScheffΓ©Most conservative, complex comparisons
Games-HowellUnequal variances

32.9. ⚠️ Don't Use This When

  • Only 2 groups (use t-test)
  • Same participants across conditions (use repeated measures ANOVA)
  • Non-normal data with small n (use Kruskal-Wallis)
  • Unequal variances with unequal n (use Welch's ANOVA)

32.10. Reporting Format

"A one-way ANOVA revealed a significant effect of font type on reading speed, F(2, 12) = 109.5, p < .001, Ξ·Β² = .95. Post-hoc Tukey tests showed decorative fonts (M = 90.0) were significantly slower than both sans-serif (M = 120.8, p < .001) and serif (M = 116.4, p < .001)."

5.1_ANOVA_Visualization5.1_ANOVA_Visualization


33. 5.8 Two-Way ANOVA

33.1. What It Tests

Tests the effects of two independent variables (and their interaction) on a continuous outcome.

33.2. When to Use It

  • Two categorical independent variables (factors)
  • One continuous dependent variable
  • Interested in interaction effects

33.3. The Three Questions It Answers

  1. Main effect of Factor A: Does A affect the outcome, averaging across levels of B?
  2. Main effect of Factor B: Does B affect the outcome, averaging across levels of A?
  3. Interaction AΓ—B: Does the effect of A depend on the level of B?

33.4. Understanding Interactions

5.8_twowayAnova5.8_twowayAnova

33.5. Example

Research question: How do device type (phone/tablet) and user age (young/old) affect task completion time?

PhoneTablet
Young4540
Old7550

Results:

  • Main effect of Device: F(1,36) = 25.3, p < .001 (tablet faster)
  • Main effect of Age: F(1,36) = 42.1, p < .001 (young faster)
  • Interaction Device Γ— Age: F(1,36) = 8.7, p = .006

Interpretation: The interaction indicates that the device difference is larger for older users (25 sec) than younger users (5 sec).

33.6. ⚠️ Don't Use This When

  • One or both factors have only 1 level
  • Design is unbalanced and you need Type III sums of squares
  • Factors are within-subjects (use repeated measures ANOVA)

33.7. Reporting Format

"A 2Γ—2 between-subjects ANOVA revealed significant main effects of device, F(1, 36) = 25.3, p < .001, Ξ·Β²p = .41, and age, F(1, 36) = 42.1, p < .001, Ξ·Β²p = .54. These were qualified by a significant interaction, F(1, 36) = 8.7, p = .006, Ξ·Β²p = .19, indicating that older adults benefited more from tablets than younger adults."


34. 5.9 Repeated Measures ANOVA

34.1. What It Tests

Tests whether means differ across three or more related groups (same participants measured multiple times).

34.2. When to Use It

  • Same participants measured at 3+ time points/conditions
  • One continuous outcome
  • Want to control for individual differences

34.3. Assumptions

  • βœ“ No significant outliers
  • βœ“ Normality of differences
  • βœ“ Sphericity (equal variances of differences between conditions)

34.4. The Sphericity Problem

Unlike between-subjects ANOVA, repeated measures ANOVA requires that the variances of differences between all pairs of conditions are equal.

``` Conditions: A, B, C

Sphericity requires:
Var(A-B) β‰ˆ Var(B-C) β‰ˆ Var(A-C)

If violated:
β€’ F-ratio becomes too liberal (inflated Type I error)
β€’ Use Greenhouse-Geisser or Huynh-Feldt correction

```

34.5. Example

Research question: Does performance change across three training sessions?

ParticipantSession 1Session 2Session 3
1506580
2456075
3557082
4486278
5526885

Mauchly's test: p = 0.12 (sphericity OK) F(2, 8) = 156.7, p < .001

Interpretation: Performance significantly improved across sessions.

34.6. ⚠️ Don't Use This When

  • Only 2 conditions (use paired t-test)
  • Different participants in each condition (use one-way ANOVA)
  • Sphericity severely violated and corrections don't help (use MANOVA or mixed models)

34.7. Reporting Format

"A one-way repeated measures ANOVA showed a significant effect of training session on performance, F(2, 8) = 156.7, p < .001, Ξ·Β²p = .98. Mauchly's test indicated sphericity was met, χ²(2) = 4.27, p = .12."


35. 5.10 Kruskal-Wallis Test

35.1. What It Tests

Tests whether distributions differ across three or more independent groups. Non-parametric alternative to one-way ANOVA.

35.2. When to Use It

  • 3+ independent groups
  • Data is ordinal or continuous but non-normal
  • Small samples with unknown distribution

35.3. Assumptions

  • βœ“ Independent groups
  • βœ“ At least ordinal data
  • βœ“ Similar distribution shapes (for median comparison)

35.4. The Formula

H=12N(N+1)βˆ‘j=1kRj2njβˆ’3(N+1)H = \frac{12}{N(N+1)} \sum_{j=1}^{k} \frac{R_j^2}{n_j} - 3(N+1)

Where:

  • N = total sample size
  • Rβ±Ό = sum of ranks in group j
  • nβ±Ό = sample size of group j

Intuition: Like ANOVA but using ranks instead of raw scores.

35.5. Example

Research question: Do three design prototypes differ in perceived usability?

Prototype APrototype BPrototype C
573
482
664
373

H = 8.56, df = 2, p = .014

Interpretation: Usability perceptions significantly differ across prototypes.

35.6. Post-Hoc Tests

  • Dunn's test with Bonferroni correction
  • Mann-Whitney U tests with Bonferroni correction

35.7. ⚠️ Don't Use This When

  • Data meets ANOVA assumptions (ANOVA has more power)
  • Same participants across conditions (use Friedman test)
  • You need to compare means specifically

35.8. Reporting Format

"A Kruskal-Wallis test indicated significant differences in usability ratings across prototypes, H(2) = 8.56, p = .014. Post-hoc Dunn's tests showed Prototype B (Mdn = 7) was rated higher than Prototype C (Mdn = 3), p = .011."


36. 5.11 Friedman Test

36.1. What It Tests

Tests whether distributions differ across three or more related groups. Non-parametric alternative to repeated measures ANOVA.

36.2. When to Use It

  • Same participants measured at 3+ time points/conditions
  • Data is ordinal or continuous but non-normal
  • Sphericity assumption cannot be met

36.3. Assumptions

  • βœ“ Same participants in all conditions
  • βœ“ At least ordinal data
  • βœ“ Random sample of participants

36.4. The Formula

Ο‡F2=12nk(k+1)βˆ‘j=1kRj2βˆ’3n(k+1)\chi^2_F = \frac{12}{nk(k+1)} \sum_{j=1}^{k} R_j^2 - 3n(k+1)

Where:

  • n = number of participants
  • k = number of conditions
  • Rβ±Ό = sum of ranks for condition j

Process:

  1. Rank scores within each participant (1 to k)
  2. Sum ranks for each condition
  3. Calculate test statistic

36.5. Example

Research question: Do users rate three app interfaces differently?

UserInterface AInterface BInterface C
13 (rank 1)7 (rank 3)5 (rank 2)
24 (rank 1)6 (rank 2)8 (rank 3)
32 (rank 1)5 (rank 2)7 (rank 3)
45 (rank 2)8 (rank 3)4 (rank 1)

Rank sums: R_A = 5, R_B = 10, R_C = 9

χ²_F = 4.5, df = 2, p = .105

Interpretation: No significant difference in interface ratings.

36.6. Post-Hoc Tests

  • Wilcoxon signed-rank tests with Bonferroni correction
  • Conover test

36.7. ⚠️ Don't Use This When

  • Data meets repeated measures ANOVA assumptions
  • Only 2 conditions (use Wilcoxon signed-rank)
  • Different participants in each group (use Kruskal-Wallis)

36.8. Reporting Format

"A Friedman test showed no significant difference in interface ratings, χ²(2) = 4.5, p = .105."


Chapter 6: Relationships Between Variables β€” Correlation Tests

Correlation quantifies the strength and direction of a relationship between two variables. This chapter covers when and how to use different correlation methods.

37. 6.1 Understanding Correlation Fundamentals

37.1. What Correlation Measures

Correlation coefficient (r): A standardized measure of the linear relationship between two variables.

Range: -1 to +1

6.1_correlation6.1_correlation

37.2. Correlation Strength Guidelines

| |r| | Interpretation | |------|----------------| | 0.00 - 0.19 | Negligible | | 0.20 - 0.39 | Weak | | 0.40 - 0.59 | Moderate | | 0.60 - 0.79 | Strong | | 0.80 - 1.00 | Very strong |

⚠️ Important: These are guidelines, not rules. Context matters! A correlation of 0.30 might be impressive in psychology but weak in physics.

37.3. Correlation β‰  Causation

Classic examples:

  • Ice cream sales and drowning deaths are correlated (both caused by hot weather)
  • Shoe size and reading ability in children are correlated (both caused by age)

6.1_correlation_causation6.1_correlation_causation


38. 6.2 Pearson Product-Moment Correlation

38.1. What It Tests

Measures the strength of linear relationship between two continuous variables.

38.2. When to Use It

  • Both variables are continuous (interval/ratio)
  • Relationship appears linear
  • Both variables approximately normally distributed
  • No extreme outliers

38.3. Assumptions

  • βœ“ Continuous data
  • βœ“ Linear relationship
  • βœ“ Bivariate normality (both variables normal)
  • βœ“ Homoscedasticity (equal variance across range)
  • βœ“ No extreme outliers

38.4. The Formula

r=βˆ‘(Xiβˆ’XΛ‰)(Yiβˆ’YΛ‰)βˆ‘(Xiβˆ’XΛ‰)2βˆ‘(Yiβˆ’YΛ‰)2r = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum(X_i - \bar{X})^2 \sum(Y_i - \bar{Y})^2}}

Simplified form:

r=Covariance(X,Y)SDXΓ—SDYr = \frac{\text{Covariance}(X, Y)}{SD_X \times SD_Y}

Intuition: Correlation = Covariance standardized by the product of standard deviations

38.5. Coefficient of Determination (rΒ²)

rΒ² tells you the proportion of variance in Y explained by X.

If r = 0.60, then rΒ² = 0.36 = 36% of variance explained.

6.2_pearsonrsquare6.2_pearsonrsquare

38.6. ⚠️ Don't Use This When

  • Relationship is non-linear (check scatter plot!)
  • Data is ordinal (use Spearman)
  • Severe outliers present (use Spearman)
  • Non-normal distributions (use Spearman)

38.7. Reporting Format

"A Pearson correlation revealed a significant strong positive relationship between VR exposure time and motion sickness, r(4) = .97, p < .001, rΒ² = .94."


39. 6.3 Spearman Rank Correlation

39.1. What It Tests

Measures the strength of monotonic relationship between two variables using ranks.

39.2. When to Use It

  • Data is ordinal
  • Continuous data but non-normal
  • Relationship is monotonic but not linear
  • Outliers present

39.3. Assumptions

  • βœ“ At least ordinal data
  • βœ“ Monotonic relationship (consistently increasing or decreasing)
  • βœ“ Independent observations

39.4. The Formula

ρ=1βˆ’6βˆ‘di2n(n2βˆ’1)\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)}

Where:

  • dα΅’ = difference between ranks of Xα΅’ and Yα΅’
  • n = number of pairs

Process:

  1. Rank X values (1 to n)
  2. Rank Y values (1 to n)
  3. Calculate difference between ranks for each pair
  4. Apply formula

39.5. When Spearman > Pearson

6.2_spearman6.2_spearman

39.6. ⚠️ Don't Use This When

  • You specifically need to measure linear relationship
  • You want to quantify amount of variance explained (rΒ² interpretation changes)
  • Data is truly continuous and normal (Pearson is more powerful)

39.7. Reporting Format

"A Spearman correlation indicated a strong positive relationship between usability rankings and satisfaction rankings, ρ(3) = .80, p = .104."


40. 6.4 Other Correlation Types

40.1. Point-Biserial Correlation (rpb)

Use when: One variable is continuous, one is binary (0/1)

Example: Correlation between gender (M/F) and test scores

Note: Mathematically equivalent to Pearson correlation; gives same result.

40.2. Phi Coefficient (Ο†)

Use when: Both variables are binary (0/1)

Example: Correlation between purchase (yes/no) and email opened (yes/no)

Ο•=adβˆ’bc(a+b)(c+d)(a+c)(b+d)\phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}

Where a, b, c, d are frequencies in a 2Γ—2 contingency table.

40.3. Kendall's Tau (Ο„)

Use when:

  • Ordinal data with many tied ranks
  • Small sample sizes
  • More conservative than Spearman

40.4. Partial Correlation

Use when: You want to measure correlation between X and Y while controlling for Z.

Example: Correlation between ice cream sales and drowning, controlling for temperature.

rXYβ‹…Z=rXYβˆ’rXZΓ—rYZ(1βˆ’rXZ2)(1βˆ’rYZ2)r_{XY \cdot Z} = \frac{r_{XY} - r_{XZ} \times r_{YZ}}{\sqrt{(1-r_{XZ}^2)(1-r_{YZ}^2)}}


41. 6.5 The Correlation Matrix

41.1. What It Is

A table showing correlations between all pairs of variables.

41.2. Structure

6.5_correlation_matrix6.5_correlation_matrix

41.3. When to Use It

  • Exploratory data analysis
  • Before regression (check for multicollinearity)
  • Before factor analysis or PCA
  • Identifying clusters of related variables

6.1_Correlation_Matrix6.1_Correlation_Matrix


42. 6.6 Correlation Quick Reference

ScenarioTestAssumptions
Both continuous, linear relationship, normalPearsonBivariate normality, linearity, homoscedasticity
One or both ordinalSpearmanMonotonic relationship
Continuous but non-normal or outliersSpearmanMonotonic relationship
Non-linear but monotonicSpearmanMonotonic relationship
Many tied ranks, small sampleKendall's Ο„Ordinal data
One continuous, one binaryPoint-biserialSame as Pearson
Both binaryPhi (Ο†)2Γ—2 table
Control for third variablePartial correlationVaries

Chapter 7: Effect Size β€” The Forgotten Hero

P-values tell you IF an effect exists. Effect sizes tell you HOW BIG it is. This chapter explains why effect sizes matter more than p-values for practical decisions.

43. 7.1 Why Effect Size Matters

43.1. The Problem with P-Values Alone

7.1_effectSize7.1_effectSize

43.2. Effect Size = Practical Significance

Statistical SignificancePractical Significance
Question"Is there ANY effect?""Is the effect BIG ENOUGH to matter?"
Measurep-valueEffect size
Influenced bySample sizeOnly the actual effect
Needed forPublication conventionsReal-world decisions

44. 7.2 Cohen's d β€” The Gold Standard for Mean Comparisons

44.1. What It Is

Cohen's d expresses the difference between means in standard deviation units.

44.2. The Formula

For independent groups:

d=XΛ‰1βˆ’XΛ‰2spooledd = \frac{\bar{X}_1 - \bar{X}_2}{s_{pooled}}

Where:

spooled=(n1βˆ’1)s12+(n2βˆ’1)s22n1+n2βˆ’2s_{pooled} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}

For paired samples:

d=DˉsDd = \frac{\bar{D}}{s_D}

44.3. Geometric Interpretation

7.2_cohensD7.2_cohensD

44.4. Cohen's Conventions

dInterpretationOverlap
0.2Small85% overlap between distributions
0.5Medium67% overlap
0.8Large53% overlap

⚠️ Warning: These are guidelines, not rules! A d = 0.3 might be huge in one context and trivial in another.


45. 7.3 Eta-Squared (Ξ·Β²) and Partial Eta-Squared (Ξ·Β²p) β€” For ANOVA

45.1. What They Measure

Proportion of variance in the dependent variable explained by the independent variable.

45.2. Eta-Squared (Ξ·Β²)

Ξ·2=SSeffectSStotal\eta^2 = \frac{SS_{effect}}{SS_{total}}

Interpretation: "What proportion of total variance is due to this effect?"

45.3. Partial Eta-Squared (Ξ·Β²p)

Ξ·p2=SSeffectSSeffect+SSerror\eta^2_p = \frac{SS_{effect}}{SS_{effect} + SS_{error}}

Interpretation: "What proportion of variance is due to this effect, excluding variance explained by other factors?"

Use Ξ·Β²p when: Multiple factors in your design (two-way ANOVA, etc.)

45.4. Conventions for Ξ·Β² and Ξ·Β²p

ValueInterpretation
0.01Small
0.06Medium
0.14Large

46. 7.4 Omega-Squared (ω²) β€” Less Biased Alternative

46.1. The Problem with Ξ·Β²

Ξ·Β² is biasedβ€”it overestimates population effect size, especially with small samples.

46.2. The Solution: Omega-Squared

Ο‰2=SSbetweenβˆ’(kβˆ’1)MSwithinSStotal+MSwithin\omega^2 = \frac{SS_{between} - (k-1)MS_{within}}{SS_{total} + MS_{within}}

Interpretation: Same as Ξ·Β² but less biased. Generally gives smaller (more accurate) estimates.


47. 7.5 RΒ² β€” For Correlation and Regression

47.1. Coefficient of Determination

R2=r2R^2 = r^2

Interpretation: Proportion of variance in Y explained by X.

47.2. Example

If r = 0.60 between study hours and exam scores:

  • RΒ² = 0.36
  • "Study hours explain 36% of the variance in exam scores"

47.3. Adjusted RΒ² (For Multiple Regression)

Radj2=1βˆ’(1βˆ’R2)(nβˆ’1)nβˆ’kβˆ’1R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}

Where k = number of predictors

Why adjusted?: Regular RΒ² always increases when you add predictors, even useless ones. Adjusted RΒ² penalizes for adding non-helpful predictors.


48. 7.6 r (Effect Size for Non-Parametric Tests)

48.1. Converting to r

For non-parametric tests, convert test statistics to r:

From Mann-Whitney U:

r=ZNr = \frac{Z}{\sqrt{N}}

From Wilcoxon:

r=ZNr = \frac{Z}{\sqrt{N}}

48.2. Conventions for r as Effect Size

rInterpretation
0.10Small
0.30Medium
0.50Large

49. 7.7 Odds Ratio (OR) and Relative Risk (RR)

49.1. For Categorical Outcomes

Odds Ratio:

OR=a/bc/d=adbcOR = \frac{a/b}{c/d} = \frac{ad}{bc}

Using a 2Γ—2 table:

Outcome +Outcome -
Group 1ab
Group 2cd

Interpretation:

  • OR = 1: No difference
  • OR > 1: Outcome more likely in Group 1
  • OR < 1: Outcome less likely in Group 1

50. 7.8 Effect Size Quick Reference

Test TypeEffect Size MeasureSmallMediumLarge
t-testsCohen's d0.20.50.8
ANOVAΞ·Β² or Ξ·Β²p0.010.060.14
ANOVA (unbiased)ω²0.010.060.14
Correlationr or RΒ²0.1 / 0.010.3 / 0.090.5 / 0.25
Chi-squareCramΓ©r's V0.10.30.5
Non-parametricr0.10.30.5

7.1_CohensD7.1_CohensD


51. 7.9 Coefficient of Variation (CV)

51.1. What It Is

The coefficient of variation expresses standard deviation as a percentage of the mean.

CV=sXˉ×100%CV = \frac{s}{\bar{X}} \times 100\%

51.2. When to Use It

  • Comparing variability between datasets with different units
  • Comparing variability between datasets with different means
  • Assessing measurement precision
  • Quality control

51.3. Example

Comparing consistency of two instruments:

  • Instrument A: M = 100, SD = 5 β†’ CV = 5%
  • Instrument B: M = 1000, SD = 30 β†’ CV = 3%

Instrument B is more consistent despite having larger SD!

51.4. Limitations

  • Meaningless when mean is close to zero
  • Requires ratio-scale data (true zero point)

Chapter 8: Categorical Data Analysis

When your data is counts or categories rather than measurements, you need different tests.

52. 8.1 Chi-Square Goodness-of-Fit Test

52.1. What It Tests

Tests whether observed frequencies match expected frequencies for a single categorical variable.

52.2. When to Use It

  • One categorical variable
  • Testing if distribution matches expectation
  • Expected frequencies β‰₯ 5 in each category

52.3. The Formula

Ο‡2=βˆ‘(Oiβˆ’Ei)2Ei\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}

Where:

  • Oα΅’ = observed frequency in category i
  • Eα΅’ = expected frequency in category i

Intuition: Sum up standardized squared deviations from expectation.

52.4. Degrees of Freedom

df = k - 1 (where k = number of categories)

52.5. Example

Research question: Do users prefer certain color schemes equally?

ColorObservedExpected (equal preference)
Blue4530
Green2830
Red1730
Total9090

Ο‡2=(45βˆ’30)230+(28βˆ’30)230+(17βˆ’30)230=13.27\chi^2 = \frac{(45-30)^2}{30} + \frac{(28-30)^2}{30} + \frac{(17-30)^2}{30} = 13.27

df = 3 - 1 = 2, p < .01

Interpretation: Color preferences are not equal; blue is preferred.

52.6. ⚠️ Don't Use This When

  • Any expected frequency < 5 (use exact tests or combine categories)
  • Data is continuous (use different test)
  • Two categorical variables (use chi-square test of independence)

52.7. Reporting Format

"A chi-square goodness-of-fit test indicated that color preferences were not equally distributed, χ²(2) = 13.27, p = .001."


53. 8.2 Chi-Square Test of Independence

53.1. What It Tests

Tests whether two categorical variables are related (associated).

53.2. When to Use It

  • Two categorical variables
  • Independent observations
  • Expected frequencies β‰₯ 5 in most cells

53.3. The Formula

Ο‡2=βˆ‘(Oijβˆ’Eij)2Eij\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}

Where expected frequency:

Eij=(Row totali)(Column totalj)NE_{ij} = \frac{(\text{Row total}_i)(\text{Column total}_j)}{N}

53.4. Degrees of Freedom

df = (rows - 1) Γ— (columns - 1)

53.5. Example

Research question: Is there an association between device type and task completion?

CompletedNot CompletedTotal
Phone402060
Tablet55560
Total9525120

χ² = 11.37, df = 1, p < .001

Interpretation: Device type and task completion are significantly associated.

53.6. Effect Size: CramΓ©r's V

V=Ο‡2NΓ—(kβˆ’1)V = \sqrt{\frac{\chi^2}{N \times (k-1)}}

Where k = min(rows, columns)

V ranges from 0 (no association) to 1 (perfect association).

53.7. ⚠️ Don't Use This When

  • Expected values < 5 in more than 20% of cells (use Fisher's exact)
  • 2Γ—2 table with small N (use Fisher's exact)
  • Paired/repeated observations (use McNemar)

53.8. Reporting Format

"A chi-square test of independence showed a significant association between device type and task completion, χ²(1) = 11.37, p < .001, V = .31."


54. 8.3 Fisher's Exact Test

54.1. What It Tests

Same as chi-square test of independence, but exact (no approximation).

54.2. When to Use It

  • 2Γ—2 contingency table
  • Small sample (N < 20)
  • Expected frequencies < 5

54.3. How It Works

Calculates the exact probability of obtaining the observed (or more extreme) table, given fixed marginal totals.

54.4. ⚠️ Don't Use This When

  • Large samples (chi-square is fine and faster)
  • Tables larger than 2Γ—2 with large N

54.5. Reporting Format

"Fisher's exact test revealed a significant association between tutorial use and configuration success, p = .047, OR = 11.67."


55. 8.4 McNemar's Test

55.1. What It Tests

Tests for changes in proportions when the same subjects are measured twice (paired categorical data).

55.2. When to Use It

  • Same subjects measured before and after
  • Binary outcome (yes/no)
  • Interested in whether proportions changed

55.3. The Setup

After: YesAfter: No
Before: Yesab
Before: Nocd

55.4. The Formula

Ο‡2=(bβˆ’c)2b+c\chi^2 = \frac{(b-c)^2}{b+c}

Intuition: Only cells b and c matter (people who changed). Tests if changers were equally likely to go in either direction.

55.5. Example

Research question: Did training change whether employees follow safety protocols?

After: FollowAfter: Don't Follow
Before: Follow455
Before: Don't Follow2010

Ο‡2=(5βˆ’20)25+20=9.0\chi^2 = \frac{(5-20)^2}{5+20} = 9.0

df = 1, p = .003

Interpretation: The training significantly changed behavior.

55.6. ⚠️ Don't Use This When

  • Observations are independent (use chi-square)
  • More than 2 categories (use Cochran's Q or Bowker's test)
  • b + c < 25 (use exact McNemar test)

55.7. Reporting Format

"McNemar's test indicated a significant change in protocol adherence after training, χ²(1) = 9.0, p = .003."


56. 8.5 Cochran's Q Test

56.1. What It Tests

Extension of McNemar's test for three or more related groups with binary outcomes.

56.2. When to Use It

  • Same subjects measured at 3+ time points
  • Binary outcome
  • Repeated measures design

56.3. Reporting Format

"Cochran's Q test showed significant differences in preference across design iterations, Q(2) = 12.5, p = .002."


Chapter 9: Advanced Techniques

This chapter briefly covers techniques for complex data structures.

57. 9.1 Principal Component Analysis (PCA)

57.1. What It Does

Reduces many correlated variables to fewer uncorrelated components.

57.2. When to Use It

  • Too many variables to analyze individually
  • Variables are correlated (redundant)
  • Want to identify underlying dimensions
  • Data preprocessing for other analyses

57.3. How It Works (Conceptual)

9.1_pca9.1_pca

57.4. Key Outputs

  1. Eigenvalues: Variance explained by each component
  2. Eigenvectors: Weights defining each component
  3. Loadings: Correlations between variables and components
  4. Scores: Values of components for each observation

57.5. Deciding How Many Components to Keep

MethodRule
Kaiser criterionKeep components with eigenvalue > 1
Scree plotKeep components before the "elbow"
Variance explainedKeep enough to explain 70-80%

57.6. Assumptions

  • Continuous or at least interval data
  • Linear relationships between variables
  • Adequate sample size (n > 5 per variable, ideally)
  • Sufficient correlations (KMO > 0.6, Bartlett's test significant)

57.7. ⚠️ Don't Use This When

  • Variables are clearly independent (nothing to reduce)
  • You need interpretable factors (consider Factor Analysis)
  • Data is categorical (use Multiple Correspondence Analysis)

58. 9.2 Factor Analysis

58.1. What It Does

Identifies latent (hidden) factors that explain correlations among variables.

58.2. Difference from PCA

PCAFactor Analysis
Data reduction techniqueTheory-driven model
Components are exact mathematical constructsFactors are assumed to cause observed correlations
Explains all varianceExplains only shared variance (communality)
No assumptions about underlying structureAssumes latent factors exist

58.3. When to Use It

  • Developing or validating a questionnaire
  • Testing theoretical constructs
  • Identifying dimensions of a concept

59. 9.3 Mixed-Effects Models (Brief Overview)

59.1. What They Do

Handle data with both fixed effects (your experimental variables) and random effects (grouping structures like participants or items).

59.2. When to Use Them

  • Repeated measures with missing data
  • Nested data (students within schools)
  • Crossed random effects (participants Γ— items)
  • Unbalanced designs

59.3. Why They're Increasingly Preferred

  • More flexible than traditional ANOVA
  • Handle missing data gracefully
  • Don't require sphericity
  • Model individual differences

60. 9.4 Bootstrap Methods (Brief Overview)

60.1. What They Do

Estimate sampling distributions by repeatedly resampling from your data.

60.2. When to Use Them

  • Assumptions of parametric tests are violated
  • Calculating confidence intervals for complex statistics
  • Small samples where distribution is unclear

60.3. Basic Concept

9.4_bootstrap9.4_bootstrap


Chapter 10: Bayesian Alternatives β€” The Other Paradigm

This chapter provides a brief introduction to Bayesian approaches as alternatives to the frequentist tests covered earlier.

61. 10.1 Key Differences Revisited

61.1. Frequentist vs. Bayesian

AspectFrequentistBayesian
Probability meansLong-run frequencyDegree of belief
Parameters areFixed but unknownRandom variables with distributions
Data isRandomFixed (observed)
Question askedP(data | hypothesis)P(hypothesis | data)
ResultP-value, confidence intervalPosterior distribution, credible interval
Prior informationNot used formallyExplicitly included

62. 10.2 Bayesian Equivalents of Common Tests

Frequentist TestBayesian Equivalent
One-sample t-testBayesian one-sample test
Independent t-testBayesian independent samples test
Paired t-testBayesian paired samples test
ANOVABayesian ANOVA
CorrelationBayesian correlation
Chi-squareBayesian contingency table analysis

62.1. Bayes Factor (BF)

The Bayesian equivalent of hypothesis testing uses the Bayes Factor:

BF10=P(Data∣H1)P(Data∣H0)BF_{10} = \frac{P(Data | H_1)}{P(Data | H_0)}

Interpretation:

BF₁₀Evidence for H₁
1-3Anecdotal
3-10Moderate
10-30Strong
30-100Very strong
>100Extreme

Key advantage: BF can provide evidence FOR the null hypothesis (BF < 1), not just fail to reject it!

62.2. Credible Intervals vs. Confidence Intervals

95% Confidence Interval (frequentist): "If we repeated this study infinitely, 95% of calculated intervals would contain the true parameter."

95% Credible Interval (Bayesian): "Given the data and prior, there's a 95% probability the parameter lies in this interval."

The Bayesian interpretation is what most people think confidence intervals mean!

63. 10.3 When to Consider Bayesian Methods

Advantages:

  • Intuitive probability statements
  • Can provide evidence for null hypothesis
  • Incorporates prior knowledge
  • No p-value problems (p-hacking, misinterpretation)

Challenges:

  • Requires specifying priors (can be controversial)
  • Computationally intensive
  • Less familiar to reviewers
  • Software learning curve

64. 10.4 Reporting Bayesian Results

"A Bayesian independent samples t-test revealed strong evidence for a difference between groups, BF₁₀ = 24.3. The posterior distribution of the effect size had a median of 0.72 (95% CI [0.35, 1.12])."


Chapter 11: Statistical Sins Researchers Commit

This chapter covers common mistakes that lead to incorrect conclusions. Learn from others' errors!

65. 11.1 P-Hacking: The Garden of Forking Paths

65.1. What It Is

Manipulating analysis (consciously or unconsciously) to achieve p < 0.05.

65.2. Common P-Hacking Techniques

SinDescription
Selective reportingOnly reporting significant results
Optional stoppingStopping data collection when p < 0.05
Outcome switchingChanging primary outcome to one that's significant
Covariate fishingAdding/removing covariates until significant
Subgroup searchingTesting many subgroups, reporting significant ones
Multiple testingRunning many tests, reporting without correction

65.3. Real Example

A famous simulation showed that through various "researcher degrees of freedom," you could find "significant" effects for almost anythingβ€”including demonstrating that listening to certain songs makes people younger!

65.4. Prevention

  • Pre-register your analysis plan
  • Report all analyses conducted
  • Use correction for multiple comparisons
  • Distinguish exploratory from confirmatory

11.1_garden_of_p11.1_garden_of_p

11.1_GardenOfForkingPaths11.1_GardenOfForkingPaths


66. 11.2 HARKing: Hypothesizing After Results are Known

66.1. What It Is

Presenting exploratory findings as if they were predicted all along.

66.2. Why It's a Problem

  • Inflates false positive rate
  • Makes findings seem more credible than warranted
  • Prevents accurate assessment of evidence

66.3. Real Example

"We hypothesized that effect X would occur in subgroup Y" β€” when actually you tested 20 subgroups and only reported the one significant finding.

66.4. Prevention

  • Pre-register hypotheses before data collection
  • Clearly label exploratory analyses
  • Be honest about the discovery process

67. 11.3 Confusing Statistical and Practical Significance

67.1. The Problem

Statistical significance (p < 0.05) only means the effect is unlikely to be exactly zero.

Practical significance means the effect is large enough to matter.

67.2. Real Example

A study with n = 100,000 found that a new teaching method improved test scores by 0.1 points (out of 100), p < 0.001.

Statistically significant? Yes! Practically significant? Absolutely not!

67.3. Prevention

  • Always report and interpret effect sizes
  • Calculate confidence intervals around effect sizes
  • Ask: "Would this effect change any decisions?"

68. 11.4 Violating Independence Assumptions

68.1. The Problem

Using tests that assume independence when observations are not independent.

68.2. Common Violations

SituationProblemSolution
Students in classroomsStudents in same class are correlatedMixed-effects models
Repeated measures treated as independentSame person's data points are correlatedRepeated measures ANOVA
Time series dataAdjacent time points are correlatedTime series analysis

68.3. Real Example

Testing whether a teaching intervention worked by treating each quiz score as independent, when actually the same 30 students took 10 quizzes each (n = 300 is actually n = 30!).


69. 11.5 Post-Hoc Power Analysis

69.1. The Problem

Calculating power AFTER finding a non-significant result to argue "we just needed more participants."

69.2. Why It's Meaningless

Post-hoc power is mathematically determined by the p-value. If p = 0.05, post-hoc power β‰ˆ 50%. Always. The calculation adds no new information.

69.3. What To Do Instead

  • Report effect sizes and confidence intervals
  • Acknowledge limitations of sample size
  • Plan power analysis BEFORE data collection

70. 11.6 Dichotomizing Continuous Variables

70.1. The Problem

Splitting continuous variables into high/low groups (e.g., median split).

70.2. Why It's Bad

  • Throws away information
  • Reduces statistical power
  • Can create spurious effects
  • Treats "just above median" and "far above median" as identical

70.3. Real Example

Splitting participants into "high anxiety" and "low anxiety" based on median score, then comparing groups. Someone who scored 49 (low) is treated as fundamentally different from someone who scored 51 (high), even though they're essentially the same.

70.4. What To Do Instead

  • Keep variables continuous when possible
  • Use regression instead of group comparisons
  • If you must categorize, use established clinical cutoffs

71. 11.7 Ignoring Multiple Comparisons

71.1. The Problem

Running many tests without correcting for inflated false positive rate.

71.2. The Math

If you run 20 tests at Ξ± = 0.05:

  • Probability of at least one false positive = 1 - (0.95)²⁰ = 64%!

71.3. Correction Methods

MethodApproachUse When
BonferroniDivide Ξ± by number of testsConservative, few tests
Holm-BonferroniSequential rejection procedureBetter power than Bonferroni
False Discovery Rate (FDR)Control proportion of false discoveriesMany tests (genomics, neuroimaging)
Tukey HSDDesigned for pairwise comparisons after ANOVAANOVA post-hoc

71.4. Real Example

A researcher tested 100 brain regions for activity differences and reported the 5 "significant" findings. Without correction, ~5 false positives are expected by chance!


72. 11.8 Misinterpreting Non-Significant Results

72.1. The Problem

Claiming "no effect" or "no difference" when p > 0.05.

72.2. The Reality

Non-significance means:

  • "We don't have enough evidence to reject the null"
  • NOT "The null hypothesis is true"
  • NOT "There is no effect"

72.3. What Affects Power to Detect Effects

  • Sample size (too small?)
  • Effect size (too small to detect?)
  • Variance (too noisy?)
  • Measurement (too imprecise?)

72.4. What To Do Instead

  • Report effect sizes and confidence intervals
  • Discuss power limitations
  • Consider equivalence testing (TOST) to actually support "no difference"
  • Use Bayesian methods to quantify evidence for null

73. 11.9 The Sin Summary Checklist

Before submitting your paper, check:

SinCheck
β–‘P-hackingDid I pre-register? Did I report all analyses?
β–‘HARKingAm I honestly distinguishing confirmatory from exploratory?
β–‘Significance confusionDid I report and interpret effect sizes?
β–‘Independence violationAre my observations truly independent?
β–‘Post-hoc powerDid I avoid this meaningless calculation?
β–‘DichotomizationDid I keep continuous variables continuous?
β–‘Multiple comparisonsDid I correct for multiple tests?
β–‘Non-significance claimsDid I avoid claiming "no effect" from p > 0.05?

Chapter 12: Quick Reference & Cheat Sheets

74. 12.1 Test Selection Cheat Sheet

74.1. By Research Question

Research QuestionParametric TestNon-Parametric Alternative
Is sample mean different from known value?One-sample t-testWilcoxon signed-rank (one-sample)
Do two independent groups differ?Independent t-testMann-Whitney U
Do paired observations differ?Paired t-testWilcoxon signed-rank
Do 3+ independent groups differ?One-way ANOVAKruskal-Wallis
Do 3+ related conditions differ?Repeated measures ANOVAFriedman test
Are two continuous variables related?Pearson correlationSpearman correlation
Are two categorical variables related?Chi-square testFisher's exact test
Did proportions change (paired binary)?β€”McNemar's test

74.2. By Data Type

DV TypeIV TypeGroupsParametricNon-Parametric
Continuousβ€”1 vs. valueOne-sample tWilcoxon (1-samp)
ContinuousCategorical2 independentIndependent tMann-Whitney U
ContinuousCategorical2 pairedPaired tWilcoxon signed
ContinuousCategorical3+ independentOne-way ANOVAKruskal-Wallis
ContinuousCategorical3+ pairedRM ANOVAFriedman
ContinuousContinuousβ€”Pearson rSpearman ρ
CategoricalCategoricalIndependentChi-squareFisher's exact
BinaryBefore/AfterPairedβ€”McNemar

75. 12.2 Assumption Checking Cheat Sheet

AssumptionHow to CheckWhat If Violated?
NormalityShapiro-Wilk test, Q-Q plotUse non-parametric or transform
Homogeneity of varianceLevene's testUse Welch's correction
SphericityMauchly's testUse G-G or H-F correction
IndependenceStudy designUse appropriate repeated measures test
LinearityScatter plotUse Spearman or transform
No extreme outliersBoxplots, Z-scores > 3Remove/Winsorize or use robust methods

76. 12.3 Effect Size Cheat Sheet

TestEffect SizeSmallMediumLarge
t-testsCohen's d0.20.50.8
ANOVAΞ·Β² or Ξ·Β²p0.010.060.14
Correlationr0.10.30.5
Chi-squareCramΓ©r's V0.10.30.5
Non-parametricr (from Z)0.10.30.5

77. 12.4 Reporting Statistics Cheat Sheet

77.1. Format Template

TestReport Format
t-testt(df) = X.XX, p = .XXX, d = X.XX
ANOVAF(df₁, dfβ‚‚) = X.XX, p = .XXX, Ξ·Β²p = .XX
Chi-squareχ²(df) = X.XX, p = .XXX, V = .XX
Correlationr(df) = .XX, p = .XXX
Mann-WhitneyU = XXX, p = .XXX, r = .XX
WilcoxonW = XXX, p = .XXX, r = .XX
Kruskal-WallisH(df) = X.XX, p = .XXX
Friedmanχ²(df) = X.XX, p = .XXX

77.2. Rules

  • Report exact p-values (p = .023), not just p < .05
  • Round to 2-3 decimal places
  • Always include effect size
  • Include degrees of freedom
  • Report means and SDs for described groups

78. 12.5 Distribution Quick Reference

DistributionShapeParametersUse For
NormalSymmetric bellΞΌ (mean), Οƒ (SD)Continuous measurements
tBell with heavy tailsdfSmall sample means
Chi-squareRight-skeweddfCategorical tests
FRight-skeweddf₁, dfβ‚‚Variance ratios (ANOVA)
BinomialDiscrete, symmetric-ishn, pCount of successes
PoissonDiscrete, right-skewedΞ»Event counts

79. 12.6 Critical Values Quick Reference

79.1. t-Distribution (Two-Tailed, Ξ± = 0.05)

dfCritical t
52.571
102.228
152.131
202.086
302.042
602.000
1201.980
∞1.960

79.2. Chi-Square (Ξ± = 0.05)

dfCritical χ²
13.841
25.991
37.815
49.488
511.070

80. 12.7 Sample Size Rules of Thumb

TestMinimum per GroupRecommended
t-test1220-30
ANOVA1220
Correlation2050+
Chi-squareExpected β‰₯ 520+ per cell
Regression10-15 per predictor20 per predictor
Factor analysis5 per variable10+ per variable

Note: These are minimums. Always do a proper power analysis for your specific situation!


81. 12.8 Decision Tree Summary (All-in-One)

12.8_allinoneTree12.8_allinoneTree


82. 12.9 Master Formula Sheet

82.1. Descriptive Statistics

XΛ‰=βˆ‘Xins=βˆ‘(Xiβˆ’XΛ‰)2nβˆ’1CV=sXΛ‰Γ—100%\bar{X} = \frac{\sum X_i}{n} \qquad s = \sqrt{\frac{\sum(X_i - \bar{X})^2}{n-1}} \qquad CV = \frac{s}{\bar{X}} \times 100\%

82.2. Test Statistics

Z=XΛ‰βˆ’ΞΌΟƒ/nt=XΛ‰βˆ’ΞΌ0s/nF=MSbetweenMSwithinZ = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \qquad t = \frac{\bar{X} - \mu_0}{s/\sqrt{n}} \qquad F = \frac{MS_{between}}{MS_{within}}

Ο‡2=βˆ‘(Oβˆ’E)2Er=βˆ‘(Xβˆ’XΛ‰)(Yβˆ’YΛ‰)βˆ‘(Xβˆ’XΛ‰)2βˆ‘(Yβˆ’YΛ‰)2\chi^2 = \sum\frac{(O-E)^2}{E} \qquad r = \frac{\sum(X-\bar{X})(Y-\bar{Y})}{\sqrt{\sum(X-\bar{X})^2\sum(Y-\bar{Y})^2}}

82.3. Effect Sizes

d=XΛ‰1βˆ’XΛ‰2spooledΞ·2=SSeffectSStotalr2=variance explainedtotal varianced = \frac{\bar{X}_1 - \bar{X}_2}{s_{pooled}} \qquad \eta^2 = \frac{SS_{effect}}{SS_{total}} \qquad r^2 = \frac{\text{variance explained}}{\text{total variance}}


12.1_MasterFlowchart12.1_MasterFlowchart


Conclusion: The Path Forward

You've now traversed the landscape of statistical testing. Here's what to remember:

  1. Always start with your research question β€” What are you trying to learn?

  2. Know your data β€” What type? What distribution? What assumptions can you make?

  3. Use the decision trees β€” Don't memorize; navigate.

  4. Check assumptions first β€” Before running any parametric test.

  5. Report effect sizes β€” P-values alone are not enough.

  6. Avoid the statistical sins β€” Pre-register, report everything, don't p-hack.

  7. When in doubt, use non-parametric β€” They're more robust, just less powerful.

  8. Consider Bayesian alternatives β€” Especially when you want to support the null.

Statistics is not about finding "significance" β€” it's about quantifying evidence to answer meaningful questions. The best analysis is one that honestly addresses your research question, acknowledges its limitations, and guides future inquiry.

Good luck with your research!


*Last updated: Jan 2026

This guide is provided for educational purposes. Always consult with a statistician for complex analyses or when stakes are high.


Appendix A: Glossary of Terms

TermDefinition
Ξ± (alpha)Significance level; probability of Type I error
Ξ² (beta)Probability of Type II error
Confidence IntervalRange likely to contain population parameter
Degrees of Freedom (df)Number of independent values that can vary
Effect SizeMagnitude of an effect, independent of sample size
Hβ‚€ (null hypothesis)Statement of no effect/difference
H₁ (alternative hypothesis)Statement that there is an effect/difference
Homogeneity of varianceEqual variances across groups
Non-parametricTests that don't assume specific distributions
Normal distributionSymmetric bell-shaped distribution
p-valueProbability of results this extreme if Hβ‚€ is true
ParametricTests that assume specific distributions
PowerProbability of correctly rejecting a false Hβ‚€
SphericityEqual variances of differences (repeated measures)
Type I ErrorFalse positive (rejecting true Hβ‚€)
Type II ErrorFalse negative (failing to reject false Hβ‚€)

Appendix B: Recommended Reading

Introductory:

  • Field, A. - Discovering Statistics Using IBM SPSS Statistics
  • Navarro, D. - Learning Statistics with R (free online)

Intermediate:

  • Cohen, J. - Statistical Power Analysis for the Behavioral Sciences
  • Cumming, G. - Understanding the New Statistics

Advanced:

  • Gelman, A. & Hill, J. - Data Analysis Using Regression and Multilevel/Hierarchical Models
  • McElreath, R. - Statistical Rethinking (Bayesian)

End of Guide

Thank you for reading. May your p-values be meaningful and your effect sizes be large!