Hypothesis Testing and Decision-Making


Lecture 02

January 27, 2025

Statistics and Decision-Making

Science as Decision-Making Under Uncertainty

Goal is to draw insights:

  • About causes and effects;
  • About interventions.

But our models are simplifications and our observations are uncertain!

XKCD 2440

Source: XKCD 2440

Data Generation Approximates Reality

Estimand Estimator Cake

Estimand Estimator Cake

Estimate Cake

Source: Richard McElreath

Bayesian (Risk-Based) Decision Analysis

Take some decision \(d(x)\) based on \(x\).

\[\overbrace{R(d(x))}^{\text{risk}} = \int_Y \overbrace{\mathcal{L}(d(x), y)}^{\text{loss function}} \overbrace{\pi(y | x)}^{\substack{\text{probability} \\ \text{of outcome}}}dy\]

Then the optimal decision is \(\hat{\alpha} = \underset{\alpha}{\operatorname{argmin}} R(\alpha)\).

Pascal’s Wager as BDA

  • Loss of mistaken belief: -c
  • Loss of mistaken disbelief: \(-\infty\)
  • Loss of correct disbelief: +c
  • Loss of correct belief: \(+\infty\)

Pascal’s conclusion: “Optimal” decision is belief regardless of asserted probability of God’s existence.

Blaise Pascal

Source: Wikipedia

Standard Parameter Estimators

What if we want to estimate a parameter \(\hat{\theta}\) from data \(x\)?

Loss Function \(\mathcal{L}(\hat{\theta}, \theta)\) \(\hat{\alpha}\)
Quadratic \(\|\hat{\theta} - \theta\|^2\) \(\text{Mean}(x)\)
Linear \(\|\hat{\theta} - \theta\|\) \(\text{Median}(x)\)
0-1 \(\begin{cases} 0 & \hat{\theta} \neq \theta \\ 1 & \hat{\theta} = \theta \end{cases}\) \(\text{Mode}(x)\)

Risk-Based Analysis: The Original Statistical Decision-Making

Orbit of Ceres

Source: Wikipedia

Piazzi’s Measurements

Source: Wikipedia

Origin of Ordinary Least Squares

Gauss (1809): Risk/Bayesian argument for OLS estimator from quadratic loss.

German 10 Mark Note with Gauss

Gauss

Source: Wikipedia

Hypothesis Testing

Questions We Might Like To Answer

  • Are high water levels influenced by environmental change?
  • Does some environmental condition have an effect on water quality/etc?
  • Does a drug or treatment have some effect?

Onus probandi incumbit ei qui dicit, non ei qui negat

Core assumption: Burden of proof is on someone claiming an effect (or a similar hypothesis).

Null Hypothesis Meme

Null Hypothesis Significance Testing

  • Check if the data is consistent with a “null” model;
  • If the data is unlikely from the null model (to some level of significance), this is evidence for the alternative.
  • If the data is consistent with the null, there is no need for an alternative hypothesis.

Alternative Hypothesis Meme

From Null Hypothesis to Null Model

…the null hypothesis must be exact, that is free of vagueness and ambiguity, because it must supply the basis of the ‘problem of distribution,’ of which the test of significance is the solution.

— R. A. Fisher, The Design of Experiments, 1935.

Example: High Water Nonstationarity

Code
# load SF tide gauge data
# read in data and get annual maxima
function load_data(fname)
    date_format = DateFormat("yyyy-mm-dd HH:MM:SS")
    # This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
    df = @chain fname begin
        CSV.read(DataFrame; header=false)
        rename("Column1" => "year", "Column2" => "month", "Column3" => "day", "Column4" => "hour", "Column5" => "gauge")
        # need to reformat the decimal date in the data file
        @transform :datetime = DateTime.(:year, :month, :day, :hour)
        # replace -99999 with missing
        @transform :gauge = ifelse.(abs.(:gauge) .>= 9999, missing, :gauge)
        select(:datetime, :gauge)
    end
    return df
end

dat = load_data("data/surge/h551.csv")

# detrend the data to remove the effects of sea-level rise and seasonal dynamics
ma_length = 366
ma_offset = Int(floor(ma_length/2))
moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]
dat_ma = DataFrame(datetime=dat.datetime[ma_offset+1:end-ma_offset], residual=dat.gauge[ma_offset+1:end-ma_offset] .- moving_average(dat.gauge, ma_offset))

# group data by year and compute the annual maxima
dat_ma = dropmissing(dat_ma) # drop missing data
dat_annmax = combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(transform(dat_ma, :datetime => x->year.(x)), :datetime_function))
delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yet
rename!(dat_annmax, :datetime_function => :Year)
select!(dat_annmax, [:Year, :residual])
dat_annmax.residual = dat_annmax.residual / 1000 # convert to m

# make plots
p1 = plot(
    dat_annmax.Year,
    dat_annmax.residual;
    xlabel="Year",
    ylabel="Annual Max Tide Level (m)",
    label=false,
    marker=:circle,
    markersize=5,
    tickfontsize=16,
    guidefontsize=18,
    left_margin=5mm, 
    bottom_margin=5mm
)

n = nrow(dat_annmax)
linfit = lm(@formula(residual ~ Year), dat_annmax)
pred = coef(linfit)[1] .+ coef(linfit)[2] * dat_annmax.Year

plot!(p1, dat_annmax.Year, pred, linewidth=3, label="Linear Trend")
Figure 1: Annual maxima surge data from the San Francisco, CA tide gauge.

The Null: Is The Trend Real?

\(\mathcal{H}_0\) (Null Hypothesis):

  • The “trend” is just due to chance, there is no long-term trend in the data.
  • Statistically:

\[y = \underbrace{b}_{\text{constant}} + \underbrace{\varepsilon}_{\text{residuals}}, \qquad \varepsilon \underbrace{\sim}_{\substack{\text{distributed} \\ {\text{according to}}}} \mathcal{N}(0, \sigma^2) \]

An Alternative Hypothesis

\(\mathcal{H}\):

  • The trend is likely non-zero in time.
  • Statistically:

\[y = a \times t + b + \varepsilon, \qquad \varepsilon \sim Normal(0, \sigma^2) \]

Null Test

Comparing \(\mathcal{H}\) with \(\mathcal{H}_0\):

  • \(\mathcal{H}\): \(a \neq 0\)
  • \(\mathcal{H}_0\): \(a = 0\)

In this example, our null is an example of a point-null hypothesis.

Computing the Test Statistic

For this type of null hypothesis test, our test statistic is the slope of the OLS fit \[\hat{a} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{(x_i - \bar{x})^2}.\]

Assuming the null, the sampling distribution of the statistic is \[\frac{\hat{a}}{SE_{\hat{a}}} \sim t_{n-2}.\]

Statistical Significance

Is the value of the test statistic consistent with the null hypothesis?

More formally, could the test statistic have been reasonably observed from a random sample given the null hypothesis?

p-Values: Quantification of “Surprise”

One-Tailed Test:

Figure 2: Illustration of a p-value

Two-Tailed Test:

Figure 3: Illustration of a two-tailed p-value

Statistical Significance

Error Types

Null Hypothesis Is
True False
Decision About Null Hypothesis Don’t reject True negative (probability \(1-\alpha\)) Type II error (probability \(\beta\))
Reject Type I Error (probability \(\alpha\)) True positive (probability \(1-\beta\))

p-Value and Significance

Common practice: If the p-value is sufficiently small (below \(\alpha\)), reject the null hypothesis with \(1-\alpha\) confidence, or declare that the alternative hypothesis is statistically significant at the \(1-\alpha\) level.

This can mean:

  1. The null hypothesis is not true for that data-generating process;
  2. The null hypothesis is true but the data is an outlying sample.

What p-Values Are Not

  1. Probability that the null hypothesis is true (this is never computed);
  2. An indication of the effect size (or the stakes of that effect).

\[ \underbrace{p(S \geq \hat{S}) | \mathcal{H}_0)}_{\text{p-value}} \neq \underbrace{p(\mathcal{H}_0 | S \geq \hat{S})}_{\substack{\text{probability of} \\ \text{null}}}!\]

Problems with Null Hypothesis Testing

Statistical Significance ≠ Scientific Significance

Statistical significance does not mean anything about:

  1. whether the alternative hypothesis is “true”;
  2. an accurate reflection of the data-generating process.

Hypothesis vs. Causal Meme

What is Any Statistical Test Doing?

  1. Assume the null hypothesis \(\mathcal{H}_0\).
  2. Compute the test statistic \(\hat{S}\) for the sample.
  3. Obtain the sampling distribution of the test statistic \(S\) under \(H_0\).
  4. Calculate \(\mathbb{P}(S > \hat{S})\) (the p-value).

Why Was \(\mathcal{H}_0\) chosen?

  • Often out of convenience for the test.
  • Point-null hypotheses are almost always wrong for the social and environmental sciences.

Point-Null Hypothesis Meme

Non-Uniqueness of “Null” Models

Is there a trend in the SF tide gauge trend data?

  • Trend as regression (\(p \approx 0.02\))
  • Mann-Kendall test for monotonic trend (\(p \approx 0.5\))

Non-Uniqueness of Null Models

Source: McElreath (2020, fig. 1.2)

Statistical Test Zoo

Zoo of Statistical Tests

Source: McElreath (2020, fig. 1.1)

Zoo of Statistical Tests

Multiple Comparisons

If you conduct multiple statistical tests, you must account for all of these in the p-value computation and assessment of significance.

Important: This includes model selection!

Multiple Comparisons Meme

Results Are Flashy, But Meaningless Without Methods

Elton John Results Section Meme

Source: Richard McElreath

Interpretability of p-Values and Significance

  • p-values are often confused with hypothesis probabilities or Type I error rates
  • p-values are a continuous measure of “surprise” at seeing a dataset given the null, but “significance” is binary.

XKCD #1478

Source: XKCD

Practical Results of NHST

Perhaps most damningly:

The null hypothesis approach, as described here and typically practiced has empirically failed to maintain rigor and credibility in the scientific literature (Ioannidis, 2005; Szucs & Ioannidis, 2017).

How Could Stats Do This Meme

Source: Richard McElreath

Practical Results of NHST

  • Overconfident confidence intervals;
  • Strawman null hypotheses;
  • Biased sampling;
  • Lack of replications;
  • p-hacking.

How Could Stats Do This Meme

Source: Richard McElreath

What Might Be More Satisfying?

  • Consideration of multiple plausible (possibly more nuanced) hypotheses.
  • Assessment/quantification of evidence consistent with different hypotheses.
  • Identification of opportunities to design experiments/learn.
  • Insight into the effect size.

Note: This Does Not Mean Null Hypothesis Testing Is Useless!

Examining and testing the implications of competing models is important, including “null” models!

Null Hypothesis Selection Good Vs. Bad

Key Points

Hypothesis Testing

  • Classical framework: Compare a null hypothesis (no effect) to an alternative (some effect)
  • \(p\)-value: probability (under \(H_0\)) of more extreme test statistic than observed.
  • “Significant” if \(p\)-value is below a significance level reflecting acceptable Type I error rate.

Problems with NHST framework

  • \(p\)-values are often over-interpreted and are often be incorrectly calculated, with negative outcomes!
  • Important: “Big” data can make things worse, as NHST is highly sensitive to small but evidence effects.

Upcoming Schedule

Next Classes

Wednesday: What is a generative model?

Next Week+: Prob/Stats “Review” and Fundamentals

Assessments

Homework 1 available; due next Friday (2/7).

References

References (Scroll for Full List)

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Med., 2, e124. https://doi.org/10.1371/journal.pmed.0020124
McElreath, R. (2020). Statistical rethinking : A bayesian course with examples in R and Stan (Second). Boca Raton, Florida: CRC. Retrieved from https://www.routledge.com/Statistical-Rethinking-A-Bayesian-Course-with-Examples-in-R-and-STAN/McElreath/p/book/9780367139919
Szucs, D., & Ioannidis, J. P. A. (2017). When null hypothesis significance testing is unsuitable for research: A reassessment. Front. Hum. Neurosci., 11, 390. https://doi.org/10.3389/fnhum.2017.00390