Hypothesis Testing and Decision-Making

Lecture 02

January 27, 2025

Statistics and Decision-Making

Science as Decision-Making Under Uncertainty

Goal is to draw insights:

About causes and effects;
About interventions.

But our models are simplifications and our observations are uncertain!

Source: XKCD 2440

Data Generation Approximates Reality

Source: Richard McElreath

Bayesian (Risk-Based) Decision Analysis

Take some decision \(d(x)\) based on \(x\).

\[\overbrace{R(d(x))}^{\text{risk}} = \int_Y \overbrace{\mathcal{L}(d(x), y)}^{\text{loss function}} \overbrace{\pi(y | x)}^{\substack{\text{probability} \\ \text{of outcome}}}dy\]

Then the optimal decision is \(\hat{\alpha} = \underset{\alpha}{\operatorname{argmin}} R(\alpha)\).

Pascal’s Wager as BDA

Loss of mistaken belief: -c
Loss of mistaken disbelief: \(-\infty\)
Loss of correct disbelief: +c
Loss of correct belief: \(+\infty\)

Pascal’s conclusion: “Optimal” decision is belief regardless of asserted probability of God’s existence.

Source: Wikipedia

Standard Parameter Estimators

What if we want to estimate a parameter \(\hat{\theta}\) from data \(x\)?

Loss Function	\(\mathcal{L}(\hat{\theta}, \theta)\)	\(\hat{\alpha}\)
Quadratic	\(\\|\hat{\theta} - \theta\\|^2\)	\(\text{Mean}(x)\)
Linear	\(\\|\hat{\theta} - \theta\\|\)	\(\text{Median}(x)\)
0-1	\(\begin{cases} 0 & \hat{\theta} \neq \theta \\ 1 & \hat{\theta} = \theta \end{cases}\)	\(\text{Mode}(x)\)

Risk-Based Analysis: The Original Statistical Decision-Making

Source: Wikipedia

Source: Wikipedia

Origin of Ordinary Least Squares

Gauss (1809): Risk/Bayesian argument for OLS estimator from quadratic loss.

Source: National Curve Bank

Source: Wikipedia

Hypothesis Testing

Questions We Might Like To Answer

Are high water levels influenced by environmental change?
Does some environmental condition have an effect on water quality/etc?
Does a drug or treatment have some effect?

Onus probandi incumbit ei qui dicit, non ei qui negat

Core assumption: Burden of proof is on someone claiming an effect (or a similar hypothesis).

Null Hypothesis Significance Testing

Check if the data is consistent with a “null” model;
If the data is unlikely from the null model (to some level of significance), this is evidence for the alternative.
If the data is consistent with the null, there is no need for an alternative hypothesis.

From Null Hypothesis to Null Model

…the null hypothesis must be exact, that is free of vagueness and ambiguity, because it must supply the basis of the ‘problem of distribution,’ of which the test of significance is the solution.

— R. A. Fisher, The Design of Experiments, 1935.

Example: High Water Nonstationarity

Code

# load SF tide gauge data
# read in data and get annual maxima
function load_data(fname)
    date_format = DateFormat("yyyy-mm-dd HH:MM:SS")
    # This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
    df = @chain fname begin
        CSV.read(DataFrame; header=false)
        rename("Column1" => "year", "Column2" => "month", "Column3" => "day", "Column4" => "hour", "Column5" => "gauge")
        # need to reformat the decimal date in the data file
        @transform :datetime = DateTime.(:year, :month, :day, :hour)
        # replace -99999 with missing
        @transform :gauge = ifelse.(abs.(:gauge) .>= 9999, missing, :gauge)
        select(:datetime, :gauge)
    end
    return df
end

dat = load_data("data/surge/h551.csv")

# detrend the data to remove the effects of sea-level rise and seasonal dynamics
ma_length = 366
ma_offset = Int(floor(ma_length/2))
moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]
dat_ma = DataFrame(datetime=dat.datetime[ma_offset+1:end-ma_offset], residual=dat.gauge[ma_offset+1:end-ma_offset] .- moving_average(dat.gauge, ma_offset))

# group data by year and compute the annual maxima
dat_ma = dropmissing(dat_ma) # drop missing data
dat_annmax = combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(transform(dat_ma, :datetime => x->year.(x)), :datetime_function))
delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yet
rename!(dat_annmax, :datetime_function => :Year)
select!(dat_annmax, [:Year, :residual])
dat_annmax.residual = dat_annmax.residual / 1000 # convert to m

# make plots
p1 = plot(
    dat_annmax.Year,
    dat_annmax.residual;
    xlabel="Year",
    ylabel="Annual Max Tide Level (m)",
    label=false,
    marker=:circle,
    markersize=5,
    tickfontsize=16,
    guidefontsize=18,
    left_margin=5mm, 
    bottom_margin=5mm
)

n = nrow(dat_annmax)
linfit = lm(@formula(residual ~ Year), dat_annmax)
pred = coef(linfit)[1] .+ coef(linfit)[2] * dat_annmax.Year

plot!(p1, dat_annmax.Year, pred, linewidth=3, label="Linear Trend")

Figure 1: Annual maxima surge data from the San Francisco, CA tide gauge.

The Null: Is The Trend Real?

\(\mathcal{H}_0\) (Null Hypothesis):

The “trend” is just due to chance, there is no long-term trend in the data.

Statistically:

\[y = \underbrace{b}_{\text{constant}} + \underbrace{\varepsilon}_{\text{residuals}}, \qquad \varepsilon \underbrace{\sim}_{\substack{\text{distributed} \\ {\text{according to}}}} \mathcal{N}(0, \sigma^2) \]

An Alternative Hypothesis

\(\mathcal{H}\):

The trend is likely non-zero in time.

Statistically:

\[y = a \times t + b + \varepsilon, \qquad \varepsilon \sim Normal(0, \sigma^2) \]

Null Test

Comparing \(\mathcal{H}\) with \(\mathcal{H}_0\):

\(\mathcal{H}\): \(a \neq 0\)
\(\mathcal{H}_0\): \(a = 0\)

In this example, our null is an example of a point-null hypothesis.

Computing the Test Statistic

For this type of null hypothesis test, our test statistic is the slope of the OLS fit \[\hat{a} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{(x_i - \bar{x})^2}.\]

Assuming the null, the sampling distribution of the statistic is \[\frac{\hat{a}}{SE_{\hat{a}}} \sim t_{n-2}.\]

Statistical Significance

Is the value of the test statistic consistent with the null hypothesis?

More formally, could the test statistic have been reasonably observed from a random sample given the null hypothesis?

p-Values: Quantification of “Surprise”

One-Tailed Test:

Figure 2: Illustration of a p-value

Two-Tailed Test:

Figure 3: Illustration of a two-tailed p-value

Statistical Significance

Error Types

		Null Hypothesis Is
		True	False
Decision About Null Hypothesis	Don’t reject	True negative (probability \(1-\alpha\))	Type II error (probability \(\beta\))
Decision About Null Hypothesis	Reject	Type I Error (probability \(\alpha\))	True positive (probability \(1-\beta\))

Navigating Type I and II Errors

The standard null hypothesis significance framework is based on balancing the chance of making Type I (false positive) and Type II (false negative) errors.

Idea: Set a significance level \(\alpha\) which is an “acceptable” probability of making a Type I error.

Aside: The probability \(1-\beta\) of correctly rejecting \(H_0\) is the power.

p-Value and Significance

Common practice: If the p-value is sufficiently small (below \(\alpha\)), reject the null hypothesis with \(1-\alpha\) confidence, or declare that the alternative hypothesis is statistically significant at the \(1-\alpha\) level.

This can mean:

The null hypothesis is not true for that data-generating process;
The null hypothesis is true but the data is an outlying sample.

What p-Values Are Not

Probability that the null hypothesis is true (this is never computed);
An indication of the effect size (or the stakes of that effect).

\[ \underbrace{p(S \geq \hat{S}) | \mathcal{H}_0)}_{\text{p-value}} \neq \underbrace{p(\mathcal{H}_0 | S \geq \hat{S})}_{\substack{\text{probability of} \\ \text{null}}}!\]

Problems with Null Hypothesis Testing

Statistical Significance ≠ Scientific Significance

Statistical significance does not mean anything about:

whether the alternative hypothesis is “true”;
an accurate reflection of the data-generating process.

What is Any Statistical Test Doing?

Assume the null hypothesis \(\mathcal{H}_0\).
Compute the test statistic \(\hat{S}\) for the sample.
Obtain the sampling distribution of the test statistic \(S\) under \(H_0\).
Calculate \(\mathbb{P}(S > \hat{S})\) (the p-value).

Why Was \(\mathcal{H}_0\) chosen?

Often out of convenience for the test.
Point-null hypotheses are almost always wrong for the social and environmental sciences.

Non-Uniqueness of “Null” Models

Is there a trend in the SF tide gauge trend data?

Trend as regression (\(p \approx 0.02\))
Mann-Kendall test for monotonic trend (\(p \approx 0.5\))

Source: McElreath (2020, fig. 1.2)

Statistical Test Zoo

Source: McElreath (2020, fig. 1.1)

Multiple Comparisons

If you conduct multiple statistical tests, you must account for all of these in the p-value computation and assessment of significance.

Important: This includes model selection!

The core issue is the standard test statistics have the right Type I/Type II properties for each individual test, but multiple tests distort these frequencies, sometimes quite dramatically.

For example, suppose each individual test has a 5% Type I error rate. If you test 100 different models, and the errors are independent, you would expect 5 false positives, one of which would be selected by minimizing the p-value. The probability of at least one type I error is >99%.

There are a number of corrections (Bonferroni being the most common), but sometimes stepwise tests are subtle, including cases of model selection followed by model fitting. You can also use simulation to estimate the false-positive rate for the procedure under a null data-generating process, which ties into the broader methods we’ll discuss.

Results Are Flashy, But Meaningless Without Methods

Elton John Results Section Meme

Source: Richard McElreath

Interpretability of p-Values and Significance

p-values are often confused with hypothesis probabilities or Type I error rates
p-values are a continuous measure of “surprise” at seeing a dataset given the null, but “significance” is binary.

Source: XKCD

Practical Results of NHST

Perhaps most damningly:

The null hypothesis approach, as described here and typically practiced has empirically failed to maintain rigor and credibility in the scientific literature (Ioannidis, 2005; Szucs & Ioannidis, 2017).

Source: Richard McElreath

Practical Results of NHST

Overconfident confidence intervals;
Strawman null hypotheses;
Biased sampling;
Lack of replications;
p-hacking.

Source: Richard McElreath

What Might Be More Satisfying?

Consideration of multiple plausible (possibly more nuanced) hypotheses.
Assessment/quantification of evidence consistent with different hypotheses.
Identification of opportunities to design experiments/learn.
Insight into the effect size.

Note: This Does Not Mean Null Hypothesis Testing Is Useless!

Examining and testing the implications of competing models is important, including “null” models!

Key Points

Hypothesis Testing

Classical framework: Compare a null hypothesis (no effect) to an alternative (some effect)
\(p\)-value: probability (under \(H_0\)) of more extreme test statistic than observed.
“Significant” if \(p\)-value is below a significance level reflecting acceptable Type I error rate.

Problems with NHST framework

\(p\)-values are often over-interpreted and are often be incorrectly calculated, with negative outcomes!
Important: “Big” data can make things worse, as NHST is highly sensitive to small but evidence effects.

Upcoming Schedule

Next Classes

Wednesday: What is a generative model?

Next Week+: Prob/Stats “Review” and Fundamentals

Assessments

Homework 1 available; due next Friday (2/7).

References

References (Scroll for Full List)

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Med., 2, e124. https://doi.org/10.1371/journal.pmed.0020124

McElreath, R. (2020). Statistical rethinking : A bayesian course with examples in R and Stan (Second). Boca Raton, Florida: CRC. Retrieved from https://www.routledge.com/Statistical-Rethinking-A-Bayesian-Course-with-Examples-in-R-and-STAN/McElreath/p/book/9780367139919

Szucs, D., & Ioannidis, J. P. A. (2017). When null hypothesis significance testing is unsuitable for research: A reassessment. Front. Hum. Neurosci., 11, 390. https://doi.org/10.3389/fnhum.2017.00390