Software

DPGP: Dirichlet process Gaussian process clustering for time series data

We developed a nonparametric model-based method, Dirichlet process Gaussian process mixture model (DPGP) to jointly model data clusters with a Dirichlet process and temporal dependencies with Gaussian processes.
The work is described in:

McDowell IC, Manandhar D, Vockley CM, Schmid AK, Reddy TE, and Engelhardt BE (2018). “Clustering gene expression time series data using an infinite Gaussian process mixture model” (PLOS Computational Biology) [PDF]

The BTH software, written and maintained by Ian McDowell, is publicly available: [Software]


BTH: Bayesian test for heteroskedasticity

We developed a general approach to identifying covariates with variance effects on a quantitative trait using a Bayesian heteroskedastic linear regression model.
The work is described in:

Dumitrascu B, Darnell G, Ayroles J, and Engelhardt BE. “A Bayesian test to identify variance effects” (submitted) [arXiv]

The BTH software, written and maintained by Bianca Dumitrascu, is publicly available: [Software]


HCPF and CCPF: Hierarchical and Coupled compound Poisson factorization

We developed a general framework, the coupled compound Poisson factorization (CCPF), to capture the missing-data mechanism in extremely sparse data sets by coupling a hierarchical Poisson factorization with an arbitrary data-generating model. In the context of matrix factorization, the hierarchical compound Poisson factorization (HCPF) decouples the sparsity model from the response model, allowing us to choose the most suitable distribution for the response. HCPF can capture binary, non-negative discrete, non-negative continuous, and zero-inflated continuous responses.
The work is described in:

Basbug M and Engelhardt BE (2016). “Hierarchical compound Poisson factorization” Proceedings of the International Conference on Machine Learning (ICML) [PDF]

Basbug M and Engelhardt BE. “Coupled compound Poisson factorization” (submitted) [arXiv]

The HCPF and CCPF software, written and maintained by Mehmet Basbug, is publicly available: [Software]


LLGP: Sparse multi-output Gaussian processes for medical time series

We developed Large Linear GP (LLGP), which circumvents the need for stationarity in time series data by inducing
structure in the LMC kernel through a common grid of inputs shared between outputs, enabling optimization of GP hyperparameters for multi-dimensional outputs and low-dimensional inputs.
The work is described in:

Feinberg V, Cheng L-F, Li K, and Engelhardt BE. ” Large linear multi-output Gaussian process learning for time series” (submitted) [arXiv]

The LLGP software, written and maintained by Vladimir Feinberg, is publicly available: [Software]


MedGP: Sparse multi-output Gaussian processes for medical time series

We developed a highly structured sparse GP kernel to enable tractable computation over tens of thousands of time points while estimating correlations among clinical covariates, patients, and periodicity in high-dimensional time series measurements of physiological signals. We applied MedGP to the MIMIC III data.
The work is described in:

Cheng L-F, Darnell G, Chivers C, Draugelis ME, Li K, and Engelhardt BE. “Sparse multi-output Gaussian processes for medical time series prediction” (submitted) [arXiv]

The MedGP software, written and maintained by Li-Fang Cheng, is publicly available: [Software]


ARSVD: Adaptive randomized singular value decomposition

We develop an adaptive randomized algorithm for efficient solutions to principal component analysis (PCA), and we use this
efficient solver to improve estimation in large-scale linear mixed models (LMMs).
The work is described in:

Darnell G, Georgiev S, Mukherjee S, and Engelhardt BE. “Adaptive randomized dimension reduction on massive data” Journal of Machine Learning Research (JMLR) [PDF]

The ARSVD software, maintained by Greg Darnell, is publicly available: [Software]


BGFA: Bayesian canonical correlation analysis and group factor analysis

Given two or more paired observation matrices, BGFA finds sparse and dense latent components corresponding to observation specific covariances or covariance terms shared across observations. In the case of m=2 observations, this model is the canonical correlation model. The linear latent space is the linear projection that maximizes the correlation across the two observations.
The work is described in:

Zhao S, Gao C, Mukherjee S, and Engelhardt BE (2016). “Bayesian group latent factor analysis with structured sparse priors” Journal of Machine Learning Research (JMLR) [PDF]

The BGFA software, written and maintained by Shiwen Zhao, is publicly available: [Software]


XFA: Expandable factor analysis

Given an observation matrix, XFA finds sparse latent components corresponding to observation covariance.
The work is described in:

Srivastava S, Engelhardt BE, and Dunson DB (2017). “Expandable factor analysis” Biometrika [PDF]

The XFA software, written by Sanvesh Srivastava, is publicly available: [Software]


BicMix: Bayesian biclustering via a doubly-sparse latent factor model

This software finds two sparse low dimensional matrices that capture sparse covariance structure in the response matrix.
The work is described in:

Gao C, Zhao S, McDowell IC, Brown CD, and Engelhardt BE. “Differential gene co-expression networks via Bayesian biclustering models” (submitted) [arXiv]

The BicMix software, written and maintained Dr. Chuan Gao, is publicly available: [Software]; send questions and comments to: chuan.gao@duke.edu


Bayesian structured sparse regression

This software computes the posterior probability of inclusion for each covariate given a set of predictors (and a positive definite matrix describing their similarity) and a quantitative response. The work is described in:

Engelhardt BE, and Adams RP. “Bayesian structured sparsity from Gaussian fields” (in review) [ArXiV]

The software is available on GitHub [Software]


Posterior predictive checks (PPCs) for admixture models

This software fits the original admixture model to genomic data and encodes the process of performing a posterior predictive check with five possible discrepancy functions. The work is described in:

Mimno D, Blei DM, and Engelhardt BE. “Posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure” (in review) [ArXiV]

The software is available on GitHub [Software]


Sparse and dense factor analysis (SFAmix)

This software computes a low-rank matrix factorization with a combination of both sparse and dense factor loadings for a given matrix, as described in

Gao C, Brown CD, and Engelhardt BE. “A latent factor model with a mixture of sparse and dense factors to model gene expression data with confounding effects” Submitted. [ArXiV]

Download C++ code, instructions, and documentation for SFAmix 1.0.


Data: publicly available eQTL study data with a uniform processing pipeline

These data sets have been processed through a single pipeline for gene expression and genotype data as described in

Brown CD, Mangravite LM, Engelhardt BE (2013). “Integrative modeling of eQTLs and cis-regulatory elements suggests mechanisms underlying cell type specificity of eQTLs” PLoS Genetics 9(8): e1003649. [PDF]

One change from the pipeline noted above is that we include genotypes imputed using Impute2 software with prephasing, and we impute up to the 1000 Genomes reference data from March 2012, and we do not filter low MAF SNPs. Note that the resulting imputed genotype files are in CHIAMO format.

[HapMap 3]


Sparse factor analysis (SFA)

This software uses ECME to compute a sparse, low-rank matrix factorization for a given matrix, as described in

Engelhardt BE, Stephens M (2010) “Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis.” PLoS Genetics 6(9):e1001117.

Download C++ code and instructions for SFA 1.0 and further documentation for the SFA model.


SIFTER: Statistical Inference of Function Through Evolutionary Relationships

SIFTER software and instructions reside at the Brenner Lab at UC Berkeley, although I am still actively maintaining the code. This software uses a
statistical model to predict protein molecular function for unannotated proteins using functional annotations from a set of homologous proteins, described in:


Engelhardt BE, Jordan MI, Srouji JR, and Brenner SE (2011) Genome-scale phylogenetic function annotation of large and diverse protein families. Genome Research (in press).