Peaks Over Thresholds

Lecture 20

April 14, 2025

Review

Extreme Values

Values with a very low probability of occurring, not necessarily high-impact events (which don’t have to be rare!).

“Block” extremes, e.g. annual maxima (block maxima)
Values which exceed a certain threshold (peaks over threshold)

Generalized Extreme Values

Block maxima \(Y_n = \max \{X_1, \ldots, X_n \}\) are modeled using Generalized Extreme Value distributions: \(GEV(\mu, \sigma, \xi)\).

Code

p1 = plot(-2:0.1:6, GeneralizedExtremeValue(0, 1, 0.5), linewidth=3, color=:red, label=L"$\xi = 1/2$", lw=3)
plot!(-4:0.1:6, GeneralizedExtremeValue(0, 1, 0), linewidth=3, color=:green, label=L"$\xi = 0$", lw=3)
plot!(-4:0.1:2, GeneralizedExtremeValue(0, 1, -0.5), linewidth=3, color=:blue, label=L"$\xi = -1/2$", lw=3)
scatter!((-2, 0), color=:red, label=:false)
scatter!((2, 0), color=:blue, label=:false)
ylabel!("Density")
xlabel!(L"$x$")
plot!(size=(600, 450))

Figure 1: Shape of the GEV distribution with different choices of \(\xi\).

Return Levels

Return Levels are a central (if poorly named) concept in risk analysis.

The \(T\)-year return level is the value expected to be observed on average once every \(T\) years.

From a GEV fit to annual maxima: \(T\)-year return level is the \(1-1/T\) quantile.

Return Periods

The return period of an extreme value is the inverse of the exceedance probability.

Example: The 100-year return period has an exceedance probability of 1%, e.g. the 0.99 quantile.

Return levels are associated with the analogous return period.

Peaks Over Thresholds

Drawbacks of Block Maxima

The block-maxima approach has two potential drawbacks:

Uses a limited amount of data;
Doesn’t capture the potential for multiple extremes within a block.

Peaks Over Thresholds

Consider the conditional excess distribution function

\[F_u(y) = \mathbb{P}(X - u > y | X > u)\]

Figure 2: Illustration of the Conditional Excess Function

Figure 3: Histogram of conditional excesses

Generalized Pareto Distribution (GPD)

For a large number of underlying distributions of \(X\), \(F_u(y)\) is well-approximated by a Generalized Pareto Distribution (GPD):

\[F_u(y) \to G(y) = 1 - \left[1 + \xi\left(\frac{y-\mu}{\sigma}\right)^{-1/\xi}\right],\] defined for \(y\) such that \(1 + \xi(y-\mu)/\sigma > 0\).

Generalized Pareto Distribution (GPD)

Similarly to the GEV distribution, the GPD distribution has three parameters:

location \(\mu\);
scale \(\sigma > 0\);
shape \(\xi\).

GPD Types

\(\xi > 0\): heavy-tailed
\(\xi = 0\): light-tailed
\(\xi < 0\): bounded

Code

p1 = plot(-2:0.1:6, GeneralizedPareto(0, 1, 0.5), linewidth=3, color=:red, label=L"$\xi = 1/2$", left_margin=5mm, bottom_margin=10mm)
plot!(-4:0.1:6, GeneralizedPareto(0, 1, 0), linewidth=3, color=:green, label=L"$\xi = 0$")
plot!(-4:0.1:2, GeneralizedPareto(0, 1, -0.5), linewidth=3, color=:blue, label=L"$\xi = -1/2$")
scatter!((-2, 0), color=:red, label=:false)
scatter!((2, 0), color=:blue, label=:false)
ylabel!("Density")
xlabel!(L"$x$")
plot!(size=(600, 450))

Figure 4: Shape of the GPD distribution with different choices of \(\xi\).

Exceedances Can Occur In Clusters

Code

# load SF tide gauge data
# read in data and get annual maxima
function load_data(fname)
    date_format = DateFormat("yyyy-mm-dd HH:MM:SS")
    # This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
    df = @chain fname begin
        CSV.read(DataFrame; header=false)
        rename("Column1" => "year", "Column2" => "month", "Column3" => "day", "Column4" => "hour", "Column5" => "gauge")
        # need to reformat the decimal date in the data file
        @transform :datetime = DateTime.(:year, :month, :day, :hour)
        # replace -99999 with missing
        @transform :gauge = ifelse.(abs.(:gauge) .>= 9999, missing, :gauge)
        select(:datetime, :gauge)
    end
    return df
end

d_sf = load_data("data/surge/h551.csv")

# detrend the data to remove the effects of sea-level rise and seasonal dynamics
ma_length = 366
ma_offset = Int(floor(ma_length/2))
moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]
dat_ma = DataFrame(datetime=d_sf.datetime[ma_offset+1:end-ma_offset], residual=d_sf.gauge[ma_offset+1:end-ma_offset] .- moving_average(d_sf.gauge, ma_offset))

# group data by year and compute the annual maxima
dat_ma = dropmissing(dat_ma) # drop missing data


thresh = 1.0
dat_ma_plot = @subset(dat_ma, year.(:datetime) .> 2020)
dat_ma_plot.residual = dat_ma_plot.residual ./ 1000
p1 = plot(dat_ma_plot.datetime, dat_ma_plot.residual; linewidth=2, ylabel="Gauge Weather Variability (m)", label="Observations", legend=:bottom, xlabel="Date/Time", right_margin=10mm, left_margin=5mm, bottom_margin=5mm)
hline!([thresh], color=:red, linestyle=:dash, label="Threshold")
scatter!(dat_ma_plot.datetime[dat_ma_plot.residual .> thresh], dat_ma_plot.residual[dat_ma_plot.residual .> thresh], markershape=:x, color=:black, markersize=3, label="Exceedances")

Figure 5: Peaks Over Thresholds for the SF Tide Gauge Data

Declustering

Arns et al. (2013) note: there is no clear declustering time period to use: need to rely on physical understanding of events and “typical” durations.

If we have prior knowledge about the duration of physical processes leading to clustered extremes (e.g. storm durations), can use this. Otherwise, need some way to estimate cluster duration from the data.

Extremal Index

The most common is the extremal index \(\theta(u)\), which measures the inter-exceedance time for a given threshold \(u\).

\[0 \leq \theta(u) \leq 1,\]

where \(\theta(u) = 1\) means independence and \(\theta(u) = 0\) means the entire dataset is one cluster.

Extremal Index

\(\theta(u)\) has two meanings:

The “propensity to cluster”: \(\theta\) is the probability that the process has left one exceedance cluster;
The “reciprocal of the clustering duration”: \(1/\theta\) is the mean time between clusters.

Computing the Extremal Index

This estimator is taken from Ferro & Segers (2003).

Let \(N = \sum_{i=1}^n \mathbb{I}(X_i > u)\) be the total number of exceedances.

Denote by \(1 \leq S_1 < \ldots < S_N \leq n\) the exceedance times.

Then the inter-exceedance times are \[T_i = S_{i+1} - S_i, \quad 1 \leq i \leq N-1.\]

Computing the Extremal Index

\[\hat{\theta}(u) = \frac{2\left(\sum_{i=1}^{N-1} T_i\right)^2}{(N-1)\sum_{i=1}^{N-1}T_i^2}\]

Code

# find total number of exceedances and exceedance times
dat_ma.residual = dat_ma.residual ./ 1000 # convert to m
S = findall(dat_ma.residual .> thresh)
N = length(S)
T = diff(S) # get difference between adjacent exceedances
θ = 2 * sum(T)^2 / ((N-1) * sum(T.^2)) # extremal index

For the SF tide gauge data and \(u=1.0 \text{m}\), we get the an extremal index of 0.24 and a declustering time of 4.0 hours.

Mapping Data To Clusters

# cluster data points which occur within period
function assign_cluster(dat, period)
    cluster_index = 1
    clusters = zeros(Int, length(dat))
    for i in 1:length(dat)
        if clusters[i] == 0
            clusters[findall(abs.(dat .- dat[i]) .<= period)] .= cluster_index
            cluster_index += 1
        end
    end
    return clusters
end

# cluster exceedances that occur within a four-hour window
# @transform is a macro from DataFramesMeta.jl which adds a new column based on a data transformation
dat_exceed = dat_ma[dat_ma.residual .> thresh, :]
dat_exceed = @transform dat_exceed :cluster = assign_cluster(:datetime, Dates.Hour(4))
# find maximum value within cluster
dat_decluster = combine(dat_exceed -> dat_exceed[argmax(dat_exceed.residual), :], 
    groupby(dat_exceed, :cluster))
dat_decluster

Declustered Distribution

Code

p = histogram(dat_decluster.residual .- thresh,
    normalize = :pdf,
    label="Data",
    xlabel="Threshold Exceedance (m)",
    ylabel="PDF",
    yticks=[],
    left_margin=10mm, 
    right_margin=10mm,
    bottom_margin=5mm
    )

Figure 6: Histogram of clustered exceedances for SF tide gauge data.

GPD Fit

Code

# fit GPD
init_θ = [1.0, 1.0]
low_bds = [0.0, -Inf]
up_bds = [Inf, Inf]
gpd_lik(θ) = -sum(logpdf(GeneralizedPareto(0.0, θ[1], θ[2]), dat_decluster.residual .- thresh))
θ_mle = Optim.optimize(gpd_lik, low_bds, up_bds, init_θ).minimizer
p1 = plot!(p, GeneralizedPareto(0.0, θ_mle[1], θ_mle[2]), linewidth=3, label="GPD Fit")
plot!(size=(600, 450))

# Q-Q Plot
p2 = qqplot(GeneralizedPareto(0.0, θ_mle[1], θ_mle[2]), dat_decluster.residual .- thresh, 
    xlabel="Theoretical Quantile",
    ylabel="Empirical Quantile",
    linewidth=3,
    left_margin=5mm, 
    right_margin=10mm,
    bottom_margin=5mm)
plot!(size=(600, 450))

display(p1)
display(p2)

(a) GPD fit to tide gauge readings over 1m of San Francisco Tide Gauge Data

(b)

Figure 7

But What About Exceedance Frequency?

The GPD fit gives a distribution for how extreme threshold exceedances are when they occur.

But how often do they occur?

Code

# add column with years of occurrence
dat_decluster = @transform dat_decluster :year = Dates.year.(dat_decluster.datetime)
# group by year and add up occurrences
exceed_counts = combine(groupby(dat_decluster, :year), nrow => :count)
delete!(exceed_counts, nrow(exceed_counts)) # including 2023 will bias the count estimate
p = histogram(exceed_counts.count, legend=:false, 
    xlabel="Yearly Exceedances",
    ylabel="Count",
    left_margin=5mm,
    bottom_margin=10mm
)
plot!(size=(600, 400))

Figure 8: histogram of number of exceedances in each year

Poisson - Generalized Pareto Process

Model the number of new exceedances with a Poisson distribution

\[n \sim \text{Poisson}(\lambda_u),\]

The MLE for \(\lambda_u\) is the mean of the count data, in this case 49.5.

Then, for each \(i=1, \ldots, n\), sample \[X_i \sim \text{GeneralizedPareto}(u, \sigma, \xi).\]

Poisson - Generalized Pareto Process Return Levels

Then the return level for return period \(m\) years can be obtained by solving the quantile equation (see Coles (2001) for details):

\[\text{RL}_m = \begin{cases}u + \frac{\sigma}{\xi} \left((m\lambda_u)^\xi - 1\right) & \text{if}\ \xi \neq 0 \\ u + \sigma \log(m\lambda_u) & \text{if}\ \xi = 0.\end{cases}\]

Poisson-GPP Example

Question: What is the 100-year return period of the SF data?

Code

count_dist = fit(Poisson, exceed_counts.count)
λ = params(count_dist)[1]

p = histogram(exceed_counts.count, legend=:false, 
    xlabel="Yearly Exceedances",
    ylabel="Count",
    normalize=:pdf,
    left_margin=5mm,
    bottom_margin=10mm
)
plot!(size=(600, 500))
plot!(p, count_dist)

Figure 9: Histogram of number of exceedances in each year with Poisson fit

Poisson-GPP Example

Poisson: \(\lambda = 50\.0\)

GPD parameters:

\(\mu = 0\) (fixed)
\(\sigma = 0\.11\)
\(\xi = \-0\.17\)

\[ \begin{align*} \text{RL}_{\text{100}} &= 1 + \frac{\sigma}{\xi} \left((m\lambda)^\xi - 1\right) \\ &= 1 - \frac{0.11}{0.17}\left((100 * 50)^{-0.17} - 1\right) \\ &\approx 1.5 \text{m} \end{align*} \]

Nonstationary POT

Non-stationarity could influence either the Poisson (how many exceedances occur each year) or GPD (when exceedances occur how are they distributed)?

Often simplest to use a Poisson GLM:

\[\lambda_t = \text{Poisson}(\lambda_0 + \lambda_1 x_t)\]

and let the GPD be stationary.

Nonstationary Return Levels

Then the \(m\)-year return level associated with covariate \(x_t\) becomes

\[\text{RL}_m = \begin{cases}u + \frac{\sigma}{\xi} \left((m(\lambda_0 + \lambda_1 x_t))^\xi - 1\right) & \text{if}\ \xi \neq 0 \\ u + \sigma \log(m(\lambda_0 + \lambda_1 x_t)) & \text{if}\ \xi = 0.\end{cases}\]

Upcoming Schedule

Wednesday: Missing Data and E-M Algorithm

Monday: Mixture Models and Model-Based Clustering (maybe)

Next Wednesday (4/23): No Class

Assessments

HW5 released, due 5/2.

References

Arns, A., Wahl, T., Haigh, I. D., Jensen, J., & Pattiaratchi, C. (2013). Estimating extreme water level probabilities: A comparison of the direct methods and recommendations for best practise. Coast. Eng., 81, 51–66. https://doi.org/10.1016/j.coastaleng.2013.07.003

Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values. London: Springer-Verlag London.

Ferro, C. A. T., & Segers, J. (2003). Inference for clusters of extreme values. J. R. Stat. Soc. Series B Stat. Methodol., 65, 545–556. https://doi.org/10.1111/1467-9868.00401