UCSC-SOE-23-02: Scalable Model Selection with Mixtures of g-Priors in Large Data Settings

Jacob Fontana and Bruno Sansó
03/06/2023 02:38 PM
We consider the variable selection problem for linear models using a mixture of g-priors. While a commonly studied aspect of these models is their posterior consistency properties in the case where the true model is a subset of the possible set of covariates ($\mathcal{M}$-closed), we examine the $\mathcal{M}$-open case, where the data generating process is not an element of the model space. Two unique problems present themselves in this setting when examining large data sets (on the order of $10^6$ observations): Shrinkage deficient estimation SDE, and model superinduction (MS). SDE refers to the phenomenon where the posterior shrinkage decays at a linear rate with the sample size. MS refers to the tendency of model selection procedures to select larger models as the sample size grows. We prove that when comparing nested models using Bayes factors, for a sufficiently large sample size the larger model will be selected. We consider many cases where this behavior results in overparametrized models which induce severe computational difficulties. We show that this phenomena is inescapable (affecting even oracle estimators), so we instead seek to minimize the severity of its effects on large sample sizes while preserving posterior consistency. To that end, we propose a beta-prime hyper-prior on g, with hyper-parameters chosen to result in a sub-linear decay of the posterior shrinkage. We also propose a model space prior which biases the posterior odds ratio towards smaller models asymptotically. These two priors introduce two new hyper-parameters, for which we propose default values. We demonstrate the aforementioned phenomena, and the efficacy of our proposed solutions, via several synthetic data examples, as well as a case study using albedo data from GOES satellites.