α-Shape Sampler: Pipeline for Generating 2D and 3D Biological Shapes and Images

Understanding morphological variation is an important task in many applications. Recent studies in computational biology have focused on developing computational tools for the task of sub-image selection which aims at identifying structural features that best describe the variation between classes of shapes. A major part in assessing the utility of these approaches is to demonstrate their performance on both simulated and real datasets. However, when creating a model for shape statistics, real data can be difficult to access and the sample sizes for these data are often small due to them being expensive to collect. Meanwhile, the landscape of current shape simulation methods has been mostly limited to approaches that use black-box inference---making it difficult to systematically assess the power and calibration of sub-image models. The α-shape sampler is an R package for generating synthetic alpha shapes by either (i) empirical sampling based on an existing dataset with reference shapes, or (ii) probabilistic sampling from a known distribution function on shapes. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

E.T. Winn-Nuñez, H. Witt, D. Bhaskar, R.Y. Huang, J.S. Reichner, I.Y. Wong, and L. Crawford. Generative modeling of biological shapes and images using a probabilistic α-shape sampler. bioRxiv. 2023.10.10.561790.

Contact:

Please contact Emily Winn-Nuñez or Lorin Crawford with any comments or questions.


BAKR: Bayesian Approximate Kernel Regression

BAKR is a software package that provides an effect size analog for each of the input features within Bayesian kernel regression models. Nonlinear kernel regression models are often used in statistics and machine learning due to greater accuracy than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. BAKR uses function analytic properties of shift-invariant reproducing kernel Hilbert spaces (RKHS) to define a linear vector space that: (i) captures nonlinear structure, and (ii) can be projected onto the original explanatory variables. The projection onto the original explanatory variables serves as an analog of effect sizes. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

L. Crawford, K.C. Wood, X. Zhou, and S. Mukherjee (2018). Bayesian approximate kernel regression with variable selection. Journal of the American Statistical Association. 113(524): 1710-1721.

Contact:

Please contact Lorin Crawford with any comments or questions.


BANNs: Biologically Annotated Neural Networks

BANNs is a software package that implements a class of probabilistic feedforward Bayesian models with partially connected architectures that are guided by predefined SNP-set annotations. This setup yields a fully interpretable neural network where the input layer encodes SNP-level effects, and the hidden layer models the aggregated effects among SNP-sets. Part of the key innovation in BANNs is to treat the weights and connections of the network as random variables with prior distributions that reflect how genetic effects manifest at different genomic scales. The BANNs software uses scalable variational inference to provide fully interpretable posterior summaries which allow researchers to simultaneously perform (i) fine-mapping with SNPs and (ii) enrichment analyses with SNP-sets on complex traits. The software is distributed under the GNU General Public License.

Download:

We implement BANNs in three different software packages. The first two are implemented in Python using Tensorflow and numpy, respectively. The third version is implemented in R. All software is currently available on GitHub.

Citations:

P. Demetci*, W. Cheng*, G. Darnell, X. Zhou, S. Ramachandran, and L. Crawford (2021). Multi-scale inference of genetic architecture using biologically annotated neural networks. PLOS Genetics. 17(8): e1009754.

Contact:

Please contact Pinar Demetci or Wei Cheng with any comments or questions.


callback: Calibrated Clustering via Knockoffs

Standard single-cell RNA-sequencing (scRNA-seq) pipelines nearly always include unsupervised clustering as a key step in identifying biologically distinct cell types. A follow-up step in these pipelines is to test for differential expression between the identified clusters. When algorithms over-cluster, downstream analyses will produce inflated P-values resulting in increased false discoveries. In this software package, we present callback (Calibrated Clustering via Knockoffs): a new method for protecting against over-clustering by controlling for the impact of reusing the same data twice when performing differential expression analysis, commonly known as “double-dipping”. Importantly, our approach can be applied to a wide range of clustering algorithms. Using real and simulated data, we show that callback provides state-of-the-art clustering performance and can rapidly analyze large-scale scRNA-seq studies, even on a personal laptop. The software is distributed under the MIT License.

Download:

An open-source software implementation of callback is available on GitHub. Package documentation including examples and articles can be on GitHub Pages.

Citations:

A. DenAdel, M.L. Ramseier, A. Navia, A.K. Shalek, S. Raghavan, P.S. Winter, A.P. Amini, and L. Crawford. A knockoff calibration method to avoid over-clustering in single-cell RNA-sequencing. bioRxiv. 2024.03.08.584180.

Contact:

Please contact Alan DenAdel or Lorin Crawford with any comments or questions.


ESNN: Ensemble of Single-Effect Neural Networks

ESNN is a software package that implements the “ensemble of single-effect neural networks” framework which generalizes the “sum of single-effects” regression framework by both accounting for nonlinear structure in genotypic data (e.g., dominance effects) and having the capability to model discrete phenotypes (e.g., case-control studies). The ESNN model uses scalable variational inference with an assumed grouped “single-effect” shrinkage prior on the input weights of neural networks which allows it to produce posterior inclusion probabilities and credible sets that can guide variable selection. While motivated by fine-mapping in genome-wide association (GWA) studies, this method is also applicable to other fields especially when data are correlated and sparse. The software is distributed under the MIT License.

Download:

Source code and tutorials for implementing the “ensemble of single-effect neural networks” (ESNN) framework are publicly available on GitHub.

Citations:

W. Cheng, S. Ramachandran, and L. Crawford (2022). Uncertainty quantification in variable selection for genetic fine-mapping using Bayesian neural networks. iScience. 25(7): 104553.

Contact:

Please contact Wei Cheng with any comments or questions.


gene-ε: Recalibrated Hypothesis Test for SNP-Level Summary Statistics

gene-ε (pronounced "genie") is software that implements a new empirical Bayesian approach for identifying statistical associations between sets of variants and quantitative traits The central innovation of gene-ε is reformulating the genome-wide association null model to distinguish between (i) mutations that are statistically associated with the disease but are unlikely to directly influence it, and (ii) mutations that are most strongly associated with a disease of interest. With a reformulated SNP-level null hypothesis, gene-ε presents a powerful framework for enrichment methods and scales well for application to emerging biobank datasets. The software is distributed under the MIT License.

Download:

The software is currently available on GitHub.

Citations:

W. Cheng, S. Ramachandran, and L. Crawford (2020). Estimation of non-null SNP effect size distributions enables the detection of enriched genes underlying complex traits. PLOS Genetics. 16(6): e1008855.

Contact:

Please contact Wei Cheng with any comments or questions.


GOALS: The GlObal And Local Score

GOALS is a software package that provides a simple post hoc approach to simultaneously assess local and global feature variable importance in nonlinear models. The ability to interpret machine learning models has become increasingly important as their usage in data science continues to rise. Most current interpretability methods are optimized to work on either (i) a global scale, where the goal is to rank features based on their contributions to overall variation in an observed population, or (ii) the local level, which aims to detail on how important a feature is to a particular individual in the dataset. Motivated by problems in statistical genetics, we demonstrate GOALS using Gaussian process regression where understanding how genetic markers affect trait architecture both among individuals and across populations is of high interest. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

E.T. Winn-Nuñez, M. Griffin, and L. Crawford (2024). A simple approach for local and global variable importance in nonlinear regression models. Computational Statistics & Data Analysis. 194: 107914.

Contact:

Please contact Emily Winn-Nuñez or Lorin Crawford with any comments or questions.


Grid-LMM: Fast and Flexible Linear Mixed Models for Genetic Association Studies

Grid-LMM is a software package for fitting linear mixed models (LMMs) with multiple random effects. The fitting process is optimized for repeated evaluation of the random effect model with different sets of fixed effects (e.g., for genome-wide association studies or GWAS analyses). The approximation is due to the use of a discrete grid of possible values for the random effect variance component proportions. Grid-LMM includes functions for both frequentist and Bayesian GWAS, (restricted) maximum likelihood (REML) evaluation, and Bayesian Posterior inference of variance components. The software is distributed under the MIT License.

Download:

The software is currently available on GitHub.

Citations:

D.E. Runcie and L. Crawford (2019). Fast and general-purpose linear mixed models for genome-wide genetics. PLOS Genetics. 15(2): e1007978.

Contact:

Please contact Dan Runcie with any comments or questions.


HEBAE: Hierarchical Empirical Bayes Autoencoder

HEBAE is a software package that implements a computationally stable framework for probabilistic and Bayesian generative models. The contributions from HEBAE to the autoencoder literature are two-fold. First, HEBAE makes performance gains by placing a hierarchical prior over the encoding distribution, enabling us to adaptively balance the trade-off between minimizing the reconstruction loss function and avoiding over-regularization. Second, HEBAE assumes a general dependency structure between variables in the latent space which produces better convergence onto the mean-field assumption for improved posterior inference. Overall, HEBAE is more robust to a wide-range of hyperparameter initializations than an analogous (and more traditional) variational autoencoder or VAE. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

W. Cheng, G. Darnell, S. Ramachandran, and L. Crawford. Generalizing variational autoencoders with hierarchical empirical Bayes. arXiv. 2007.10389.

Contact:

Please contact Wei Cheng with any comments or questions.


i-LDSC: Interaction-LD Score Regression

i-LDSC is the software implementing interaction-LD score regression. LD score regression (LDSC) is a method to estimate narrow-sense heritability from genome-wide association study (GWAS) summary statistics alone, making it a fast and popular approach. The key concept underlying the LDSC framework is that there is a positive linear relationship between the magnitude of GWAS allelic effect estimates and linkage disequilibrium (LD) when complex traits are generated under the infinitesimal model — that is, causal variants are uniformly distributed along the genome and each have the same expected contribution to phenotypic variation. We present interaction-LD score (i-LDSC) regression: an extension of the original LDSC framework that accounts for non-additive genetic effects. By studying a wide range of generative models in simulations, and by re-analyzing 25 well-studied quantitative phenotypes from 349,468 individuals in the UK Biobank and up to 159,095 individuals in BioBank Japan, we show that the inclusion of a cis-interaction score (i.e., interactions between a focal variant and nearby variants) significantly recovers substantial non-additive heritability that is not captured by LDSC. For each of the 25 traits analyzed in the UK Biobank and 23 of the 25 traits analyzed in BioBank Japan, i-LDSC detects a significant amount of variation contributed by genetic interactions. The i-LDSC software is distributed under the GNU General Public License.

Download:

The software package for i-LDSC is currently available on GitHub.

Citations:

S.P. Smith*, G. Darnell*, D. Udwin, A. Harpak, S. Ramachandran, and L. Crawford. Accounting for statistical non-additive interactions enables the recovery of missing heritability from GWAS summary statistics. bioRxiv. 2022.07.21.501001.

Contact:

Please contact Lorin Crawford with any comments or questions.


MAPIT: The MAriginal ePIstasis Test

MAPIT is the software implementing the new strategy for mapping epistasis: instead of directly identifying individual pairwise or higher-order interactions, MAPIT focuses on mapping variants that have non-zero marginal epistatic effects --- the combined pairwise interaction effects between a given variant and all other variants. By testing marginal epistatic effects, MAPIT can identify candidate variants that are involved in epistasis without the need to identify the exact partners with which the variants interact, thus potentially alleviating much of the statistical and computational burden associated with standard epistatic mapping procedures. MAPIT is based on a variance component model, and relies on a recently developed variance component estimation method for efficient parameter inference and p-value computation. The software is distributed under the GNU General Public License.

Download:

The software package for MAPIT is currently available on GitHub. The software package for MAPIT-R for enrichment of genomic regions and SNP-sets is currently available on CRAN.

Citations:

L. Crawford, P. Zeng, S. Mukherjee, and X. Zhou (2017). Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLOS Genetics. 13(7): e1006869.

M.C. Turchin, G. Darnell, L. Crawford, and S. Ramachandran. Pathway analysis within multiple human ancestries reveals novel signals for epistasis in complex traits. bioRxiv. 2020.09.24.312421.

Contact:

Please contact Lorin Crawford with any comments or questions.


MegaLMM: Mega-scale Linear Mixed Models for Multivariate Genomic Prediction

MegaLMM is a software package for fitting multi-trait linear mixed models (MvLMMs) with multiple random effects. There are many notable and unique aspects of MegaLMM relative to other factor models. (1) Residuals of the phenotype after accounting for the factors are not assumed to be iid, but are modeled with independent (across traits) LMMs accounting for both fixed and random effects. (2) The factors themseleves are also not assumed to be iid, but are modeled with the same LMMs. This highlights the parallel belief that these latent factors represent traits that we just didn't measure directly. (3) Each factor is shared by all modeled sources of variation (fixed effects, random effects and residuals), rather than being unique to a particular source. (4) he factor loadings are strongly regularized so ensure that estimation is efficient. We accomplish this by ordering the factors from most-to-least important using a prior similar to that proposed by Bhattarchya and Dunson (2011). The software is distributed under the PolyForm Noncommercial License.

Download:

The software is currently available on GitHub.

Citations:

D.E. Runcie, J. Qu, H. Cheng, and L. Crawford (2021). Mega-scale linear mixed models for genomic predictions with thousands of traits. Genome Biology. 22: 213.

Contact:

Please contact Dan Runcie with any comments or questions.


Multioviz: Platform for Analyzing Gene Regulatory Networks

Multioviz is a web-based tool and R package for in silico exploration and assessment of GRNs. While many GRN platforms have been developed, a majority do not allow for perturbation analyses where a user is able to impose modifications onto a network (i.e., the addition or subtraction of a node or edge) and invoke a statistical reanalysis to learn how a phenotype might change with new sets of molecular interactions. The key contribution of Multioviz is that it enables in silico perturbation experiments within an easy-to-use interface that includes the following three main features. First, it allows users to couple summary statistics from a computational analysis (e.g., p-values or PIPs) along with a set of biological annotations (e.g., SNPs within the boundary of a gene) to visualize multi-level genomic relationships in the form of a GRN. Second, it allows users to perturb these learned networks and and investigate the associated ramifications on a phenotype of interest. Lastly, Multioviz integrates various variable selection methods to give users a wide choice of statistical approaches that they can use to generate relevant multi-level genomic signatures for their analyses. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub. An online platform tool is also available and hosted by Brown University.

Citations:

H. Xie, L. Crawford, and A. Conard. Multioviz: an interactive platform for in silico perturbation and interrogation of gene regulatory networks. bioRxiv. 2023.10.10.561790.

Contact:

Please contact Ashley Conard or Lorin Crawford with any comments or questions.


mvMAPIT: The Multivariate MAriginal ePIstasis Test

mvMAPIT is the software implementing the multi-outcome extension of the statistical framework MAPIT which aims to identify variants that are involved in epistatic interactions by leveraging the correlation structure of non-additive genetic variation that is shared between multiple traits. By searching for marginal epistatic effects, one can identify genetic variants that are involved in epistasis without the need to identify the exact partners with which the variants interact – thus, potentially alleviating much of the statistical and computational burden associated with conventional explicit search based methods. Our proposed mvMAPIT builds upon this strategy by leveraging correlation structures between traits to improve the identification of variants involved in epistasis. We formulate mvMAPIT as a multivariate linear mixed model and develop a multi-trait variance component estimation algorithm for efficient parameter inference and p-value computation. Together with reasonable model approximations, our proposed approach is scalable to moderately sized genome-wide association (GWA) studies. The software is distributed under the GNU General Public License.

Download:

The software package for mvMAPIT is currently available on GitHub. Package documentation including examples and articles can be on GitHub Pages.

Citations:

L. Crawford, P. Zeng, S. Mukherjee, and X. Zhou (2017). Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLOS Genetics. 13(7): e1006869.

J. Stamp, A. DenAdel, D. Weinreich, and L. Crawford. Leveraging the genetic correlation between traits improves the detection of epistasis in genome-wide association studies. bioRxiv. 2022.11.30.518547.

Contact:

Please contact Julian Stamp or Lorin Crawford with any comments or questions.


NCLUSION: Nonparametric Clustering of Single-cell Populations

NCLUSION is the software implementing an infinite mixture model that leverages Bayesian sparse priors to identify marker genes while simultaneously performing clustering on single-cell expression data. Clustering is commonly used in single-cell RNA-sequencing (scRNA-seq) pipelines to characterize cellular heterogeneity. However, current methods face two main limitations. First, they require user-specified heuristics which add time and complexity to bioinformatic workflows; second, they rely on post-selective differential expression analyses to identify marker genes driving cluster differences, which has been shown to be subject to inflated false discovery rates. We address these challenges by introducing "nonparametric clustering of single-cell populations" or NCLUSION. NCLUSION uses a scalable variational inference algorithm to perform these analyses on datasets with up to millions of cells. NCLUSION (i) matches the performance of other state-of-the-art clustering techniques with significantly reduced runtime and (ii) provides statistically robust and biologically relevant transcriptomic signatures for each of the clusters it identifies. Overall, NCLUSION represents a reliable hypothesis-generating tool for understanding patterns of expression variation present in single-cell populations. The software is distributed under the MIT License.

Download:

An open-source software implementation of NCLUSION is available on GitHub. Package documentation including examples and articles can be on GitHub Pages.

Citations:

C. Nwizu, M. Hughes, M.L. Ramseier, A. Navia, A.K. Shalek, N. Fusi, S. Raghavan, P.S. Winter, A.P. Amini, and L. Crawford. Scalable nonparametric clustering with unified marker gene selection for single-cell RNA-seq data. bioRxiv. 2024.02.11.579839.

Contact:

Please contact Chibuikem Nwizu or Lorin Crawford with any comments or questions.


RATE: RelATive cEntrality Measures for Variable Prioritization

RATE is a software package that provides a novel for assessing input variable importance after having fit a nonlinear or nonparametric (Bayesian) model. By assessing entropy in the joint posterior distribution via Kullback-Leibler divergence (KLD), RATE can correctly prioritize candidate variables which are not just marginally important, but also those whose associations stem from a significant covarying relationship with other variables in the data. RATE is demonstrated in the context of statistical genetics, where the discovery of variants that are involved in nonlinear interactions is of particular interest. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

L. Crawford, S.R. Flaxman, D.E. Runcie, and M. West (2019). Variable prioritization in nonlinear black box methods: a genetic association case study. Annals of Applied Statistics. 13(2): 958-989.

J. Ish-Horowicz*, D. Udwin*, K. Scharfstein, S.R. Flaxman, L. Crawford, and S.L. Filippi. Interpreting deep neural networks through variable importance. arXiv. 1901.09839.

Contact:

Please contact Lorin Crawford with any comments or questions.


SECT: The Smooth Euler Characteristic Transform

This software package explores the use of a novel statistic, the smooth Euler characteristic transform (SECT), as an automated procedure to extract geometric or topological statistics from tumor images. More generally, the SECT is designed to integrate shape information into regression models by representing shapes and surfaces as a collection of curves. Due to its well-defined inner product structure, the SECT can be used in a wider range of functional and nonparametric modeling approaches than other previously proposed topological summary statistics. We illustrate the utility of the SECT in a radiomics context by showing that the topological quantification of tumors, assayed by magnetic resonance imaging (MRI), are better predictors of clinical outcomes in patients with glioblastoma multiforme (GBM). Using publicly available data from The Cancer Genome Atlas (TCGA) and The Cancer Imaging Archive (TCIA), we show that SECT features alone explain more of the variance in patient survival than gene expression, volumetric features, and morphometric features. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

L. Crawford, A. Monod, A.X. Chen, S. Mukherjee, and R. Rabadán (2020). Predicting clinical outcomes in glioblastoma: an application of topological and functional data analysis. Journal of the American Statistical Association. 115(531): 1139-1150.

Contact:

Please contact Lorin Crawford or Anthea Monod with any comments or questions.


SINATRA: Pipeline for Sub-Image Analysis and Feature Selection on 3D Shapes

The sub-image selection problem is to identify physical regions that most explain the variation between two classes of three dimensional shapes. SINATRA is a software package that implements a statistical pipeline for carrying out sub-image analyses using topological summary statistics. The algorithm follows four key steps: (1) 3D shapes (represented as triangular meshes) are summarized by a collection of vectors (or curves) detailing their topology (e.g., Euler characteristics, persistence diagrams). (2) A statistical model is used to classify the shapes based on their topological summaries. Here, we make use of a Gaussian process classification model with a probit link function. (3) After itting the model, an association measure is computed for each topological feature (e.g., centrality measures, posterior inclusion probabilities, p-values, etc). (4) Association measures are mapped back onto the original shapes via a reconstruction algorithm — thus, highlighting evidence of the physical (spatial) locations that best explain the variation between the two groups. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

B. Wang*, T. Sudijono*, H. Kirveslahti*, T. Gao, D.M. Boyer, S. Mukherjee, and L. Crawford (2021). A statistical pipeline for identifying physical features that differentiate classes of 3D shapes. Annals of Applied Statistics. 15(2): 638-661.

Contact:

Please contact Bruce Wang or Timothy Sudijono with any comments or questions.


SINATRA Pro: Protein Conformation Analysis using Topological Summary Statistics

The sub-image selection problem is to identify physical regions that most explain the variation between two classes of three dimensional shapes. SINATRA is a statistical pipeline for carrying out sub-image analyses using topological summary statistics (Wang et al. 2021, Ann Appl Stat). SINATRA Pro is an adaptation of the SINATRA framework for structure-based applications in protein dynamics. The general algorithm follows four key steps: (1) 3D shapes of protein structures (represented as triangular meshes) are summarized by a collection of vectors (or curves) detailing their topology (e.g., Euler characteristics, persistence diagrams, etc). (2) A statistical model is used to classify the shapes based on their topological summaries. Here, we make use of a Gaussian process classification model with a probit link function. (3) After fitting the model, an association measure is computed for each topological feature (e.g., centrality measures, posterior inclusion probabilities, p-values, etc). (4) Association measures are mapped back onto the original protein structures via a reconstruction algorithm, thus, highlighting atomic or residue-level positions that best explain the variation between two ensembles. The software is distributed under the MIT License.

Download:

The software is currently available on GitHub.

Citations:

W.S. Tang*, G.M. da Silva*, H. Kirveslahti, E. Skeens, B. Feng, T. Sudijono, K.K. Yang, S. Mukherjee, B. Rubenstein, and L. Crawford (2022). A topological data analytic approach for discovering biophysical signatures in protein dynamics. PLOS Computational Biology. 18(5): e1010045.

Contact:

Please contact Wai Shing Tang with any comments or questions.


Tropix: Tropical Sufficient Statistics for Persistent Homology

Tropix is a software package that uses an embedding in Euclidean space based on tropical geometry to generate stable sufficient statistics for barcodes --- multiscale summaries of topological characteristics that capture the “shape” of data, but have complex structures and are therefore difficult to use in statistical settings. This statistical sufficiency result allows for the assumption of classical probability distributions on Euclidean representations of barcodes. This in turn makes a variety of parametric inference methods amenable to barcodes, all while maintaining their initial interpretations. In particular, this work shows that exponential family distributions may be assumed, and that likelihoods for persistent homology may be constructed. In the citation below, we use Tropix to conceptually demonstrate sufficiency and illustrate its utility in persistent homology dimensions 0 and 1 with concrete parametric applications to HIV and avian influenza data. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

A. Monod, S. Kališnik Verovšek, J.Á. Patiño-Galindo, and L. Crawford (2019). Tropical sufficient statistics for persistent homology. SIAM Journal on Applied Algebra and Geometry. 3(2): 337-371.

Contact:

Please contact Lorin Crawford or Anthea Monod with any comments or questions.