BAKR: Bayesian Approximate Kernel Regression

BAKR is a software package that provides an effect size analog for each of the input features within Bayesian kernel regression models. Nonlinear kernel regression models are often used in statistics and machine learning due to greater accuracy than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. BAKR uses function analytic properties of shift-invariant reproducing kernel Hilbert spaces (RKHS) to define a linear vector space that: (i) captures nonlinear structure, and (ii) can be projected onto the original explanatory variables. The projection onto the original explanatory variables serves as an analog of effect sizes. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

L. Crawford, K.C. Wood, X. Zhou, and S. Mukherjee (2018). Bayesian approximate kernel regression with variable selection. Journal of the American Statistical Association. 113(524): 1710-1721.

Contact:

Please contact Lorin Crawford with any comments or questions.


BANNs: Biologically Annotated Neural Networks

BANNs is a software package that implements a class of probabilistic feedforward Bayesian models with partially connected architectures that are guided by predefined SNP-set annotations. This setup yields a fully interpretable neural network where the input layer encodes SNP-level effects, and the hidden layer models the aggregated effects among SNP-sets. Part of the key innovation in BANNs is to treat the weights and connections of the network as random variables with prior distributions that reflect how genetic effects manifest at different genomic scales. The BANNs software uses scalable variational inference to provide fully interpretable posterior summaries which allow researchers to simultaneously perform (i) fine-mapping with SNPs and (ii) enrichment analyses with SNP-sets on complex traits. The software is distributed under the GNU General Public License.

Download:

We implement BANNs in three different software packages. The first two are implemented in Python using Tensorflow and numpy, respectively. The third version is implemented in R. All software is currently available on GitHub.

Citations:

P. Demetci*, W. Cheng*, G. Darnell, X. Zhou, S. Ramachandran, and L. Crawford. Multi-scale inference of genetic architecture using biologically annotated neural networks. bioRxiv. 2020.07.02.184465.

Contact:

Please contact Pinar Demetci or Wei Cheng with any comments or questions.


gene-ε: Recalibrated Hypothesis Test for SNP-Level Summary Statistics

gene-ε (pronounced "genie") is software that implements a new empirical Bayesian approach for identifying statistical associations between sets of variants and quantitative traits The central innovation of gene-ε is reformulating the genome-wide association null model to distinguish between (i) mutations that are statistically associated with the disease but are unlikely to directly influence it, and (ii) mutations that are most strongly associated with a disease of interest. With a reformulated SNP-level null hypothesis, gene-ε presents a powerful framework for enrichment methods and scales well for application to emerging biobank datasets. The software is distributed under the MIT License.

Download:

The software is currently available on GitHub.

Citations:

W. Cheng, S. Ramachandran, and L. Crawford (2020). Estimation of non-null SNP effect size distributions enables the detection of enriched genes underlying complex traits. PLOS Genetics. 16(6): e1008855.

Contact:

Please contact Wei Cheng with any comments or questions.


Grid-LMM: Fast and Flexible Linear Mixed Models for Genetic Association Studies

Grid-LMM is a software package for fitting linear mixed models (LMMs) with multiple random effects. The fitting process is optimized for repeated evaluation of the random effect model with different sets of fixed effects (e.g., for genome-wide association studies or GWAS analyses). The approximation is due to the use of a discrete grid of possible values for the random effect variance component proportions. Grid-LMM includes functions for both frequentist and Bayesian GWAS, (restricted) maximum likelihood (REML) evaluation, and Bayesian Posterior inference of variance components. The software is distributed under the MIT License.

Download:

The software is currently available on GitHub.

Citations:

D.E. Runcie# and L. Crawford (2019). Fast and general-purpose linear mixed models for genome-wide genetics. PLOS Genetics. 15(2): e1007978.

Contact:

Please contact Dan Runcie with any comments or questions.


HEBAE: Hierarchical Empirical Bayes Autoencoder

HEBAE is a software package that implements a computationally stable framework for probabilistic and Bayesian generative models. The contributions from HEBAE to the autoencoder literature are two-fold. First, HEBAE makes performance gains by placing a hierarchical prior over the encoding distribution, enabling us to adaptively balance the trade-off between minimizing the reconstruction loss function and avoiding over-regularization. Second, HEBAE assumes a general dependency structure between variables in the latent space which produces better convergence onto the mean-field assumption for improved posterior inference. Overall, HEBAE is more robust to a wide-range of hyperparameter initializations than an analogous (and more traditional) variational autoencoder or VAE. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

W. Cheng, G. Darnell, S. Ramachandran, and L. Crawford. Generalizing variational autoencoders with hierarchical empirical Bayes. arXiv. 2007.10389.

Contact:

Please contact Wei Cheng with any comments or questions.


MAPIT and MAPIT-R: MAriginal ePIstasis Test

MAPIT is the software implementing the new strategy for mapping epistasis: instead of directly identifying individual pairwise or higher-order interactions, MAPIT focuses on mapping variants that have non-zero marginal epistatic effects --- the combined pairwise interaction effects between a given variant and all other variants. By testing marginal epistatic effects, MAPIT can identify candidate variants that are involved in epistasis without the need to identify the exact partners with which the variants interact, thus potentially alleviating much of the statistical and computational burden associated with standard epistatic mapping procedures. MAPIT is based on a variance component model, and relies on a recently developed variance component estimation method for efficient parameter inference and p-value computation. The software is distributed under the GNU General Public License.

Download:

The software package for MAPIT is currently available on GitHub. The software package for MAPIT-R for enrichment of genomic regions and SNP-sets is currently available on CRAN.

Citations:

L. Crawford, P. Zeng, S. Mukherjee, and X. Zhou (2017). Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLOS Genetics. 13(7): e1006869.

M.C. Turchin, G. Darnell, L. Crawford, and S. Ramachandran. Pathway analysis within multiple human ancestries reveals novel signals for epistasis in complex traits. bioRxiv. 2020.09.24.312421.

Contact:

Please contact Lorin Crawford or Michael Turchin with any comments or questions.


MegaLMM: Mega-scale Linear Mixed Models for Multivariate Genomic Prediction

MegaLMM is a software package for fitting multi-trait linear mixed models (MvLMMs) with multiple random effects. There are many notable and unique aspects of MegaLMM relative to other factor models. (1) Residuals of the phenotype after accounting for the factors are not assumed to be iid, but are modeled with independent (across traits) LMMs accounting for both fixed and random effects. (2) The factors themseleves are also not assumed to be iid, but are modeled with the same LMMs. This highlights the parallel belief that these latent factors represent traits that we just didn't measure directly. (3) Each factor is shared by all modeled sources of variation (fixed effects, random effects and residuals), rather than being unique to a particular source. (4) he factor loadings are strongly regularized so ensure that estimation is efficient. We accomplish this by ordering the factors from most-to-least important using a prior similar to that proposed by Bhattarchya and Dunson (2011) The software is distributed under the PolyForm Noncommercial License.

Download:

The software is currently available on GitHub.

Citations:

D.E. Runcie, J. Qu, H. Cheng, and L. Crawford. Mega-scale linear mixed models for genomic predictions with thousands of traits. bioRxiv. 2020.05.26.116814.

Contact:

Please contact Dan Runcie with any comments or questions.


RATE: RelATive cEntrality Measures for Variable Prioritization

RATE is a software package that provides a novel for assessing input variable importance after having fit a nonlinear or nonparametric (Bayesian) model. By assessing entropy in the joint posterior distribution via Kullback-Leibler divergence (KLD), RATE can correctly prioritize candidate variables which are not just marginally important, but also those whose associations stem from a significant covarying relationship with other variables in the data. RATE is demonstrated in the context of statistical genetics, where the discovery of variants that are involved in nonlinear interactions is of particular interest. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

L. Crawford, S.R. Flaxman, D.E. Runcie, and M. West (2019). Variable prioritization in nonlinear black box methods: a genetic association case study. Annals of Applied Statistics. 13(2): 958-989.

J. Ish-Horowicz*, D. Udwin*, K. Scharfstein, S.R. Flaxman, L. Crawford, and S.L. Filippi. Interpreting deep neural networks through variable importance. arXiv. 1901.09839.

Contact:

Please contact Lorin Crawford with any comments or questions.


SECT: The Smooth Euler Characteristic Transform

This software package explores the use of a novel statistic, the smooth Euler characteristic transform (SECT), as an automated procedure to extract geometric or topological statistics from tumor images. More generally, the SECT is designed to integrate shape information into regression models by representing shapes and surfaces as a collection of curves. Due to its well-defined inner product structure, the SECT can be used in a wider range of functional and nonparametric modeling approaches than other previously proposed topological summary statistics. We illustrate the utility of the SECT in a radiomics context by showing that the topological quantification of tumors, assayed by magnetic resonance imaging (MRI), are better predictors of clinical outcomes in patients with glioblastoma multiforme (GBM). Using publicly available data from The Cancer Genome Atlas (TCGA) and The Cancer Imaging Archive (TCIA), we show that SECT features alone explain more of the variance in patient survival than gene expression, volumetric features, and morphometric features. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

L. Crawford, A. Monod, A.X. Chen, S. Mukherjee, and R. Rabadán (2020). Predicting clinical outcomes in glioblastoma: an application of topological and functional data analysis. Journal of the American Statistical Association. 115(531): 1139-1150.

Contact:

Please contact Lorin Crawford or Anthea Monod with any comments or questions.


SINATRA: Pipeline for Sub-Image Analysis and Feature Selection on 3D Shapes

The sub-image selection problem is to identify physical regions that most explain the variation between two classes of three dimensional shapes. SINATRA is a software package that implements a statistical pipeline for carrying out sub-image analyses using topological summary statistics. The algorithm follows four key steps: (1) 3D shapes (represented as triangular meshes) are summarized by a collection of vectors (or curves) detailing their topology (e.g. Euler characteristics, persistence diagrams). (2) A statistical model is used to classify the shapes based on their topological summaries. Here, we make use of a Gaussian process classification model with a probit link function. (3) After itting the model, an association measure is computed for each topological feature (e.g. centrality measures, posterior inclusion probabilities, p-values, etc). (4) Association measures are mapped back onto the original shapes via a reconstruction algorithm — thus, highlighting evidence of the physical (spatial) locations that best explain the variation between the two groups. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

B. Wang*, T. Sudijono*, H. Kirveslahti*, T. Gao, D.M. Boyer, S. Mukherjee#, and L. Crawford. A statistical pipeline for identifying physical features that differentiate classes of 3D shapes. Annals of Applied Statistics. In Press.

Contact:

Please contact Bruce Wang or Timothy Sudijono with any comments or questions.


Tropix: Tropical Sufficient Statistics for Persistent Homology

Tropix is a software package that uses an embedding in Euclidean space based on tropical geometry to generate stable sufficient statistics for barcodes --- multiscale summaries of topological characteristics that capture the “shape” of data, but have complex structures and are therefore difficult to use in statistical settings. This statistical sufficiency result allows for the assumption of classical probability distributions on Euclidean representations of barcodes. This in turn makes a variety of parametric inference methods amenable to barcodes, all while maintaining their initial interpretations. In particular, this work shows that exponential family distributions may be assumed, and that likelihoods for persistent homology may be constructed. In the citation below, we use Tropix to conceptually demonstrate sufficiency and illustrate its utility in persistent homology dimensions 0 and 1 with concrete parametric applications to HIV and avian influenza data. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

A. Monod, S. Kališnik Verovšek, J.Á. Patiño-Galindo, and L. Crawford (2019). Tropical sufficient statistics for persistent homology. SIAM Journal on Applied Algebra and Geometry. 3(2): 337-371.

Contact:

Please contact Lorin Crawford or Anthea Monod with any comments or questions.