Authors:
(1) Mohamed A. Abba, Department of Statistics, North Carolina State University;
(2) Brian J. Reich, Department of Statistics, North Carolina State University;
(3) Reetam Majumder, Southeast Climate Adaptation Science Center, North Carolina State University;
(4) Brandon Feng, Department of Statistics, North Carolina State University.
Table of Links
1.1 Methods to handle large spatial datasets
1.2 Review of stochastic gradient methods
2 Matern Gaussian Process Model and its Approximations
3 The SG-MCMC Algorithm and 3.1 SG Langevin Dynamics
3.2 Derivation of gradients and Fisher information for SGRLD
4 Simulation Study and 4.1 Data generation
4.2 Competing methods and metrics
5 Analysis of Global Ocean Temperature Data
6 Discussion, Acknowledgements, and References
Appendix A.1: Computational Details
Appendix A.2: Additional Results
1.1 Methods to handle large spatial datasets
The main computational bottleneck in GP regression is evaluating the inverse of the covariance matrix. To overcome this problem, a large body of literature has been proposed over the last decades, including but not limited to low rank approximations, covariance tapering, divide-and conquer strategies, and Vecchia-type methods. Although these approaches differ significantly, they all tend to result in an amenable structure on the covariance or its inverse.
A low-rank approximation of the GP can be used to overcome the covariance inverse cost, (e.g., Cressie and Johannesson (2008); Katzfuss and Cressie (2011); Kang et al. (2009)). Low rank approximations project the spatial process on a low-dimensional space and use the low-rank representation as a surrogate to approximate the original process. Banerjee et al. (2008) used predictive process methods, where first a certain number of knots are placed in the spatial domain, then used as a conditioning set for the expectation of the original process. Fixed rank kriging (Cressie and Johannesson, 2008) approximates the original process using a small number of basis functions, which results in a precision matrix that can be obtained by inversion of a matrix with a much smaller dimension.
Instead of approximating the original process, one can impose fixed structures on the covariance or precision matrices directly. This method, also known as covariance tapering (Furrer et al., 2006; Kaufman et al., 2008), imposes a compact support on the correlation function, and hence correlation between a site and distant neighbours is shrunk to zero. This induces a sparse structure on the covariance that is leveraged to speed up the computation. Instead of imposing a structure on the covariance, Rue et al. (2009) directly imposes a sparse structure on the precision matrix using a Gaussian Markov random field approximation to the true process.
Divide-and-conquer approaches have also been proposed to scale GPs inference. Barbian and Assunc¸ao (2017) and Guhaniyogi and Banerjee (2018) propose splitting the spatial domain in subsets, performing the analysis in parallel on each subset, and then combining the results. This strategy distributes the workload into smaller parts. Another option is to divide the spatial region into independent sub-regions and perform the analysis on the whole dataset under this assumption (Sang et al., 2011). Unlike the former, the latter uses the whole dataset but reduces the computational cost using independence between subregions.
One of the earliest and most influential methods for scalable GPs is the Vecchia approximation (Vecchia, 1988; Stein et al., 2004). In the Vecchia framework, the full likelihood is factorized into a series of conditional distributions. This factorization is then simplified by reducing the conditioning sets to include a small number of neighbours, which in turn results in a sparse precision matrix. Guinness (2018) showed that Vecchia’s method is an accurate approximation to the true Gaussian process model in terms of the Kullback-Leibler divergence. This approach is also well suited for parallel computing due to the factorization of the likelihood. Recent works have built upon and extended the Vecchia approximation. Katzfuss and Guinness (2021) propose a generalization of the Vecchia’s framework and show that many existing approaches to Gaussian process approximation can be viewed as a special case of the extended method. Datta et al. (2016) proposed the nearest-neighbor Gaussian process as an extension of the Vecchia approximation, later, Finley et al. (2019) outline an efficient Markov Chain Monte Carlo (MCMC) algorithm for scalable full Bayesian inference using this method.
In general, all the aforementioned methods reduce the computational cost from cubic to linear in the number of observations. However, in the Bayesian framework we are mostly interested in posterior sampling through MCMC methods in order to get uncertainty estimates of the model parameters as well predictive credible intervals for certain locations. Typically, MCMC methods require thousands of iterations to accurately approximate the posterior distribution. Hence, even when the cost per iteration is linear, the total time can still be prohibitive. Recent work has therefore also focused on subsampling approaches for spatial data to reduce the computational cost associated with posterior sampling. Saha and Bradley (2023) have developed an efficient composite sampling scheme for posterior inference. Similarly, Heaton and Johnson (2023) use minibatches to approximate the complete conditional distribution of conjugate parameters, and provide an approximate Metropolis-Hastings (MH) acceptance probabilities for non-conjugate parameters. While Heaton and Johnson (2023) use a Vecchia approximation to define the minibatches, neither of these two works use any gradient information when drawing samples from the posterior, and are therefore fundamentally different from the gradient-based approach we will employ in our study.
This paper is available on arxiv under CC BY 4.0 DEED license.