The Engelhardt Group is involved in developing innovative statistical models and methods in order to elucidate biological mechanisms of complex phenotypes and disease. Measurements of biological systems have both noise and systematic bias, and often the analytical goal is to identify low-dimensional substructure within a high-dimensional space. These qualities are well-addressed by model-based analyses. But the high dimension and scale of biological data makes parameter estimation in sophisticated models challenging. We address these challenges by developing hierarchical statistical models and approximate parameter estimation methods to gain access to interesting biological phenomena.
Statistical Analysis of Genetic Association Studies
Multi-trait GWAS. Genome-wide association studies (GWAS) identify genetic variants that are associated with the occurrence of a complex phenotype or disease in a set of individuals. Many phenotypes are difficult to quantify with a single measure. I am building methods for conducting GWAS using survey data as the phenotype. Standard dimensionality reduction techniques are not effective for scaling down the size of the data because the resulting phenotype summaries were not interpretable. In prior work, we applied SFA and found that the sparse solution had phenotypic interpretations for all of the factors, and genetic associatons for a number of phenotypes. Our current work goes well beyond this model for greater robustness and inference of the number of factors from the underlyng data. Publications: [Hart, Engelhardt et al., 2012]
Differential eQTLs. Although it is straightforward to determine whether a SNP impacts transcription of a gene, it is less clear how to test whether a SNP regulates transcription of a gene differently in the presence of a chemical modifier. With collaborators from the Childrens Hospital Oakland Research Institute (CHORI), I am applying a Bayesian test based on regression with multiple correlated responses to determine whether statins change how a SNP modulates transcription. We are using these regression models to test five possible scenarios for differential regulation in baseline and treated cell lines in 10195 genes and 7.8 million SNPs for 480 individuals. Currently we have found several differential eQTLs affecting genes in a cholesterol pathway, along with thousands of eQTLs. We are currently developing methods for considering gene expression networks, differential mRNA interactions, and incorporating additional NGS data to refine network structure. Publications: [Mangravite, Engelhardt, et al., 2013] Bayesian tests for association. We are considering models for Bayesian tests of association between genotype and phenotype that do not include the additive or dominance assumption for quantitative traits.
Understanding how eQTLs work by looking across eQTL studies, cell types, and regulatory element data
GTEx study analyses: As part of the GTEx consortium, and in collaboration with Casey Brown, we have conducted large-scale replication studies across eleven studies in seven tissue types. We have overlaid these results onto regulatory element data to enable a much more profound mechanistic understanding of eQTL data by looking at where the eQTLs and also the cell type specific eQTLs are co-located with specific cis-regulatory elements. We are currently developing statistical models for understanding eQTLs and variants that influence mRNA isoform levels in RNA-seq data. We are also working on predictive models for eQTLs across tissue types and models that consider replication in trans-eQTLs. Publications: [Brown, Mangravite, Engelhardt, 2013]
Fine mapping of cis-eQTLs: With Ryan Adams, we have developed a Bayesian structured sparse prior where the relationships between predictors is modelled using a Gaussian process with a user-defined, domain-specific kernel function. In a regression framework, we incorporate genome-specific information into the kernel in order to facilitate fine mapping. Publication: [Engelhardt & Adams 2014]
Sparse latent factor models
Identifying population structure. Matthew Stephens and I considered the problem of identifying latent structure in a population of individuals. We considered the two methods most commonly applied to this problem, namely, admixture models and principal components analysis (PCA), in the framework of matrix factorization methods with different matrix constraints. Within this framework, we described a sparse factor analysis model (SFA) that encouraged sparsity on the factor loadings through an automatic relevance determination prior. Results from SFA bridged the gap between admixture models and PCA: SFA did not over-regularize the data like admixture models tend to do, but, unlike PCA, sparsity enabled well-separated populations to each be associated with a single factor, making the results interpretable as with admixture models. However, we found that the methods produced similar results for continuous populations; a sample of 1387 individuals with approximately 200,000 SNPs from Europe mapped to two factors captured the geography of the sample well in all three methods.
Publications: [Engelhardt & Stephens, 2010]
Gene expression analyses with confounding effects. We developed factor analysis models that have effective sparsity-inducing priors that go beyond automatic relevance determination priors and properties the traditional spike-slab type priors. A three-layer shrinkage prior on the traditional factor analysis model has behavior that includes element-wise sparsity in the loadings matrix and also non-parametric estimation of the number of latent factors from the data. We applied these approaches to gene expression data to uncover sparse sets of co-expressed genes; we use these fitted models to identify trans-eQTLs and find that we have additional power above univariate approaches. Publications: [Gao, Brown, Engelhardt 2013] [Gao, Zhao, McDowell, Brown, Engelhardt 2014]
Bayesian canonical correlation analysis. We developed a method for Bayesian canonical correlation analysis (or, more generally, group factor analysis) that estimates sparse covariance matrices for any subset of multiple observations of the same samples. We applied this model to genotype and gene expression data from eQTL studies to perform multi-SNP, multi-trait association mapping. We also use this approach to identify covariance matrices specific to data covariates, e.g., sexually dimorphic gene co-expression networks. Publications: [Gao, Zhao, McDowell, Brown, Engelhardt 2014]
Epigenome-wide association studies
We have used exploratory data analysis approaches to examine the latent structure in epigenetic markers, including methylation status. We have developed classifiers to try to predict methylation status at individual CpG sites using genomic markers. We are currently developing methods for analysis of methylation marks and other epigenome-wide scans for association of epigenetic with phenotypes of interest.
Publications: [Zhang, Spector, Deloukas, Bell, Engelhardt 2015]
Protein molecular function prediction
As a graduate student with Dr. Michael Jordan, collaborating with Dr. Steven Brenner, I created a statistical methodology, SIFTER (Statistical Inference of Function Through Evolutionary Relationships), to capture how protein molecular function evolves within a phylogeny in order to accurately predict function for unannotated proteins, improving over existing methods that use pairwise sequence comparisons. We relied on the assumption that function evolves in parallel with sequence evolution, implying that phylogenetic distance is the natural measure of functional divergence. In SIFTER, molecular function evolves as a first-order Markov chain within a phylogenetic tree. Posterior probabilities are computed exactly using message-passing, with an approximate method for large or functionally diverse protein families; model parameters are estimated using generalized expectation maximization. Functional predictions are extracted from protein-specific posterior probabilities for each function. I applied SIFTER to a genome-scale fungal data set, which included families of proteins from 46 fully-sequenced fungal genomes, and SIFTER substantially outperformed state-of-the-art methods in producing correct and specific predictions.
Experimental design. I developed a sequential experimental design method to rank the protein experimental characterizations that are expected to most improve the confidence of sifter predictions. This component leverages the posterior probabilities from the SIFTER model to rank proteins based on a mutual information-based criterion. Experimental design in this setting enables biologists to perform a minimal number of expensive and labor-intensive experiments to better understand how molecular function evolves in a protein family. The experimental design method performed well; in one family the necessary experiments were reduced from 28 to four.
Publications: [Engelhardt, Jordan & Brenner, in prep]