Figure 4: Shape of the GPD distribution with different choices of \(\xi\).
Exceedances Can Occur In Clusters
Code
# load SF tide gauge data# read in data and get annual maximafunctionload_data(fname) date_format =DateFormat("yyyy-mm-dd HH:MM:SS")# This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data df =@chain fname begin CSV.read(DataFrame; header=false)rename("Column1"=>"year", "Column2"=>"month", "Column3"=>"day", "Column4"=>"hour", "Column5"=>"gauge")# need to reformat the decimal date in the data file@transform:datetime =DateTime.(:year, :month, :day, :hour)# replace -99999 with missing@transform:gauge =ifelse.(abs.(:gauge) .>=9999, missing, :gauge)select(:datetime, :gauge)endreturn dfendd_sf =load_data("data/surge/h551.csv")# detrend the data to remove the effects of sea-level rise and seasonal dynamicsma_length =366ma_offset =Int(floor(ma_length/2))moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]dat_ma =DataFrame(datetime=d_sf.datetime[ma_offset+1:end-ma_offset], residual=d_sf.gauge[ma_offset+1:end-ma_offset] .-moving_average(d_sf.gauge, ma_offset))# group data by year and compute the annual maximadat_ma =dropmissing(dat_ma) # drop missing datathresh =1.0dat_ma_plot =@subset(dat_ma, year.(:datetime) .>2020)dat_ma_plot.residual = dat_ma_plot.residual ./1000p1 =plot(dat_ma_plot.datetime, dat_ma_plot.residual; linewidth=2, ylabel="Gauge Weather Variability (m)", label="Observations", legend=:bottom, xlabel="Date/Time", right_margin=10mm, left_margin=5mm, bottom_margin=5mm)hline!([thresh], color=:red, linestyle=:dash, label="Threshold")scatter!(dat_ma_plot.datetime[dat_ma_plot.residual .> thresh], dat_ma_plot.residual[dat_ma_plot.residual .> thresh], markershape=:x, color=:black, markersize=3, label="Exceedances")
Figure 5: Peaks Over Thresholds for the SF Tide Gauge Data
Declustering
Arns et al. (2013) note: there is no clear declustering time period to use: need to rely on physical understanding of events and “typical” durations.
If we have prior knowledge about the duration of physical processes leading to clustered extremes (e.g. storm durations), can use this. Otherwise, need some way to estimate cluster duration from the data.
Extremal Index
The most common is the extremal index\(\theta(u)\), which measures the inter-exceedance time for a given threshold \(u\).
\[0 \leq \theta(u) \leq 1,\]
where \(\theta(u) = 1\) means independence and \(\theta(u) = 0\) means the entire dataset is one cluster.
Extremal Index
\(\theta(u)\) has two meanings:
The “propensity to cluster”: \(\theta\) is the probability that the process has left one exceedance cluster;
The “reciprocal of the clustering duration”: \(1/\theta\) is the mean time between clusters.
Computing the Extremal Index
This estimator is taken from Ferro & Segers (2003).
Let \(N = \sum_{i=1}^n \mathbb{I}(X_i > u)\) be the total number of exceedances.
Denote by \(1 \leq S_1 < \ldots < S_N \leq n\) the exceedance times.
Then the inter-exceedance times are \[T_i = S_{i+1} - S_i, \quad 1 \leq i \leq N-1.\]
# find total number of exceedances and exceedance timesdat_ma.residual = dat_ma.residual ./1000# convert to mS =findall(dat_ma.residual .> thresh)N =length(S)T =diff(S) # get difference between adjacent exceedancesθ =2*sum(T)^2/ ((N-1) *sum(T.^2)) # extremal index
For the SF tide gauge data and \(u=1.0 \text{m}\), we get the an extremal index of 0.24 and a declustering time of 4.0 hours.
Mapping Data To Clusters
# cluster data points which occur within periodfunctionassign_cluster(dat, period) cluster_index =1 clusters =zeros(Int, length(dat))for i in1:length(dat)if clusters[i] ==0 clusters[findall(abs.(dat .- dat[i]) .<= period)] .= cluster_index cluster_index +=1endendreturn clustersend# cluster exceedances that occur within a four-hour window# @transform is a macro from DataFramesMeta.jl which adds a new column based on a data transformationdat_exceed = dat_ma[dat_ma.residual .> thresh, :]dat_exceed =@transform dat_exceed :cluster =assign_cluster(:datetime, Dates.Hour(4))# find maximum value within clusterdat_decluster =combine(dat_exceed -> dat_exceed[argmax(dat_exceed.residual), :], groupby(dat_exceed, :cluster))dat_decluster
(a) GPD fit to tide gauge readings over 1m of San Francisco Tide Gauge Data
(b)
Figure 7
But What About Exceedance Frequency?
The GPD fit gives a distribution for how extreme threshold exceedances are when they occur.
But how often do they occur?
Code
# add column with years of occurrencedat_decluster =@transform dat_decluster :year =Dates.year.(dat_decluster.datetime)# group by year and add up occurrencesexceed_counts =combine(groupby(dat_decluster, :year), nrow =>:count)delete!(exceed_counts, nrow(exceed_counts)) # including 2023 will bias the count estimatep =histogram(exceed_counts.count, legend=:false, xlabel="Yearly Exceedances", ylabel="Count", left_margin=5mm, bottom_margin=10mm)plot!(size=(600, 400))
Figure 8: histogram of number of exceedances in each year
Poisson - Generalized Pareto Process
Model the number of new exceedances with a Poisson distribution
\[n \sim \text{Poisson}(\lambda_u),\]
The MLE for \(\lambda_u\) is the mean of the count data, in this case 49.5.
Then, for each \(i=1, \ldots, n\), sample \[X_i \sim \text{GeneralizedPareto}(u, \sigma, \xi).\]
Poisson - Generalized Pareto Process Return Levels
Then the return level for return period \(m\) years can be obtained by solving the quantile equation (see Coles (2001) for details):
Monday: Mixture Models and Model-Based Clustering (maybe)
Next Wednesday (4/23): No Class
Assessments
HW5 released, due 5/2.
References
Arns, A., Wahl, T., Haigh, I. D., Jensen, J., & Pattiaratchi, C. (2013). Estimating extreme water level probabilities: A comparison of the direct methods and recommendations for best practise. Coast. Eng., 81, 51–66. https://doi.org/10.1016/j.coastaleng.2013.07.003
Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values. London: Springer-Verlag London.
Ferro, C. A. T., & Segers, J. (2003). Inference for clusters of extreme values. J. R. Stat. Soc. Series B Stat. Methodol., 65, 545–556. https://doi.org/10.1111/1467-9868.00401