Table Of ContentStatistical Methods for Genome-wide Association Studies and
Personalized Medicine
by
Jie Liu
A dissertation submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
(Computer Sciences)
at the
UNIVERSITY OF WISCONSIN-MADISON
2014
Date of final oral examination: 05/16/14 (9am)
Room for final oral examination: CS 4310
Committee in charge:
C. David Page Jr., Professor, Biostatistics and Medical Informatics
Xiaojin Zhu, Associate Professor, Computer Sciences
Jude Shavlik, Professor, Computer Sciences
Elizabeth Burnside, Associate Professor, Radiology
Chunming Zhang, Professor, Statistics
i
Abstract
In genome-wide association studies (GWAS), researchers analyze the genetic variation across
the entire human genome, searching for variations that are associated with observable traits or
certain diseases. There are several inference challenges in GWAS, including the huge number
of genetic markers to test, the weak association between truly associated markers and the traits,
and the correlation structure between the genetic markers. This thesis mainly develops statistical
methods that are suitable for genome-wide association studies and their clinical translation for
personalized medicine.
After we introduce more background and related work in Chapters 1 and 2, we further discuss
the problem of high dimensional statistical inference, especially capturing the dependence among
multiple hypotheses, which has been under-utilized in classical multiple testing procedures. Chap-
ter 3 proposes a feature selection approach based on a unique graphical model which can leverage
correlation structure among the markers. This graphical model-based feature selection approach
significantly outperforms the conventional feature selection methods used in GWAS. Chapter 4
reformulates this feature selection approach as a multiple testing procedure that has many elegant
properties, including controlling false discovery rate at a specified level and significantly improv-
ing the power of the tests by leveraging dependence. In order to relax the parametric assumption
within the graphical model, Chapter 5 further proposes a semiparametric graphical model for mul-
tiple testing under dependence, which estimates f1 adaptively. This semiparametric approach is
still effective to capture the dependence among multiple hypotheses, and no longer requires us
to specify the parametric form of f1. It exactly generalizes the local FDR procedure [38] and
ii
connects with the BH procedure [12].
These statistical inference methods are based on graphical models, and their parameter learn-
ing is difficult due to the intractable normalization constant. Capturing the hidden patterns and
heterogeneity within the parameters is even harder. Chapters 6 and 7 discuss the problem of learn-
ing large-scale graphical models, especially dealing with issues of heterogeneous parameters and
latently-grouped parameters. Chapter 6 proposes a nonparametric approach which can adaptively
integrate, during parameter learning, background knowledge about how the different parts of the
graph can vary. For learning latently-grouped parameters in undirected graphical models, Chapter
7 imposes Dirichlet process priors over the parameters and estimates the parameters in a Bayesian
framework. The estimated model generalizes significantly better than standard maximum likeli-
hood estimation.
Chapter 8 explores the potential translation of GWAS discoveries to clinical breast cancer
diagnosis. With support from the Wisconsin Genomics Initiative, we genotyped a breast cancer
cohort at Marshfield Clinic and collected corresponding diagnostic mammograms. We discovered
that, using SNPs known to be associated with breast cancer, we can better stratify patients and
thereby significantly reduce false positives during breast cancer diagnosis, alleviating the risk of
overdiagnosis. This result suggests that when radiologists are making medical decisions from
mammograms (such as suggesting follow-up biopsies), they can consider these risky SNPs for
more accurate decisions if the patients’ genotype data are available.
Contents
Abstract i
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related Work 7
2.1 Hypothesis Testing for Case-control Association Studies . . . . . . . . . . . . . 7
2.1.1 Single-marker Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Parametric Multiple-maker Methods . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Nonparametric Multiple-maker Methods . . . . . . . . . . . . . . . . . 16
2.2 Multiple Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Error Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 P -value Thresholding Methods . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Local False Discovery Rate Methods . . . . . . . . . . . . . . . . . . . 20
2.2.4 Local Significance Index Methods . . . . . . . . . . . . . . . . . . . . . 21
2.3 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Maximum Likelihood Parameter Learning . . . . . . . . . . . . . . . . . 23
2.3.2 Bayesian Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . 29
iii
iv
2.3.3 Inference Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Feature and Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 High-Dimensional Structured Feature Screening Using Markov Random Fields 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.1 Feature Relevance Network . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2 The Construction Step . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3 The Inference Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.4 Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Real-world Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.2 Experiments on CGEMS Data . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.3 Validating Findings on Marshfield Data . . . . . . . . . . . . . . . . . . 51
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Multiple Testing under Dependence via Parametric Graphical Models 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.1 Terminology and Previous Work . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2 The Multiple Testing Procedure . . . . . . . . . . . . . . . . . . . . . . 58
4.2.3 Posterior Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.4 Parameters and Parameter Learning . . . . . . . . . . . . . . . . . . . . 60
4.3 Basic Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Simulations on Genetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Real-world Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
v
5 Multiple Testing under Dependence via Semiparametric Graphical Models 76
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.1 Graphical models for Multiple Testing . . . . . . . . . . . . . . . . . . . 80
5.3.2 Nonparametric Estimation of f1 . . . . . . . . . . . . . . . . . . . . . . 81
5.3.3 Parametric Estimation of φ and π . . . . . . . . . . . . . . . . . . . . . 82
5.3.4 Inference of θ and FDR Control . . . . . . . . . . . . . . . . . . . . . . 83
5.4 Connections with Classical Multiple Testing Procedures . . . . . . . . . . . . . 84
5.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6 Learning Heterogeneous Hidden Markov Random Fields 94
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.1 HMRFs And Homogeneity Assumption . . . . . . . . . . . . . . . . . . 96
6.2.2 Heterogeneous HMRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Parameter Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3.1 Contrastive Divergence for MRFs . . . . . . . . . . . . . . . . . . . . . 98
6.3.2 Expectation-Maximization for Learning Conventional HMRFs . . . . . . 99
6.3.3 Learning Heterogeneous HMRFs . . . . . . . . . . . . . . . . . . . . . 102
6.3.4 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.5 Real-world Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
vi
7 Bayesian Estimation of Latently-grouped Parameters in Graphical Models 115
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2 Maximum Likelihood Estimation and Bayesian Estimation for MRFs . . . . . . 117
7.3 Bayesian Parameter Estimation for MRFs with Dirichlet Process Prior . . . . . . 118
7.3.1 Metropolis-Hastings (MH) with Auxiliary Variables . . . . . . . . . . . 119
7.3.2 Gibbs Sampling with Stripped Beta Approximation . . . . . . . . . . . . 123
7.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.4.1 Simulations on Tree-structure MRFs . . . . . . . . . . . . . . . . . . . . 128
7.4.2 Simulations on Small Grid-MRFs . . . . . . . . . . . . . . . . . . . . . 128
7.4.3 Simulations on Large Grid-MRFs . . . . . . . . . . . . . . . . . . . . . 132
7.5 Real-world Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8 Genetic Variants Improve Personalized Breast Cancer Diagnosis 138
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.3.1 Performance of Combined Models . . . . . . . . . . . . . . . . . . . . . 145
8.3.2 Performance of Genetic Models . . . . . . . . . . . . . . . . . . . . . . 147
8.3.3 Comparing Breast Imaging Model and Genetic Model . . . . . . . . . . 147
8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9 Future Work 151
Chapter 1
Introduction
1.1 Background
The human genome project, which was completed in 2003, made it possible for us, for the first
time, to read the complete genetic blueprint of human beings. Since then, researchers started
looking into the germline genetics variants which are associated with the heritable diseases and
traits among humans, known as genome-wide association studies (GWAS). GWAS analyze the
genetic variation across the entire human genome, searching for variations that are associated
with observable traits or certain diseases. In machine learning terminology, typically an example
in GWAS is a human, the response variable is a disease such as breast cancer, and the features
(or variables) are the single positions in the entire genome where individuals can vary, known as
single-nucleotide polymorphisms (SNPs). The primary goal in GWAS is to identify all the SNPs
that are relevant to the diseases or the observable traits.
GWAS are characterized by high-dimension. The human genome has roughly 3 billion po-
sitions, roughly 3 million of which are SNPs. State-of-the-art technology enables measurement
of a million SNPs in one experiment for a cost of hundreds of US dollars. Although this means
the full set of known SNPs cannot be measured in one experiment at present, SNPs that are close
together on the genome are often highly correlated. Hence the omission of some SNPs is not as
1
2
much of a problem as one might first think. Instead, we have the problem of strong-correlation
among our features: most SNPs are very highly correlated with one or more nearby SNPs, with
squared Pearson correlation coefficients well above 0.8.
Another problem making GWAS especially challenging is weak-association, namely the truly
relevant markers are very rare and only weakly associated to the response variable. The first
reason is that most diseases have both a genetic and environmental component. Because of the
environmental component, we cannot expect to achieve anywhere near 100% accuracy in GWAS.
For example, it is estimated that genetics accounts for only about 27% of breast cancer risk [102].
Therefore, given equal numbers of breast cancer patients and controls without breast cancer, the
highest predictive accuracy we can reasonably expect from genetic features alone is about 63.5%,
obtainable by correctly predicting the controls and correctly recognizing 27% of the cancer cases
based on genetics. Furthermore, breast cancer and many other diseases are polygenic, and there-
fore the genetic component is spread over multiple genes. Based on these two observations, we
expect the contribution from any one feature (SNP) toward predicting disease to be quite small.
1
Indeed, one published study [82] identified only 4 SNPs associated with breast cancer. When
the most strongly associated SNP (rs1219648) is tested for its predictive accuracy on this same
training set from which it was identified (almost certainly yielding an overly-optimistic accuracy
estimate), the model based on this SNP is only 53% accurate, where majority-class or uniform
random guessing is 50% accurate. Adding credibility is another published study [33] on breast
cancer which identified 11 SNPs from a different dataset. They report the individual odds ratios
for the 11 SNPs are estimated to be around 0.95 - 1.26, and most of them are not identified to be
significant in the former study [82]. Therefore, for breast cancer and other diseases, we expect the
signal from each relevant feature to be very weak.
The combination of high-dimension and weak-association makes it extremely difficult to de-
tect the truly associated genetic markers. Suppose a truly relevant genetic marker is weakly asso-
1
Rare alleles for a few SNPs, such as those in BRCA1 and BRCA2 genes, have large effect but are very rare. Others
that are common have only a weak effect.
3
ciated with the class variable. If its odds ratio is around 1.2, given one thousand cancer cases and
one thousand controls, this marker will not look significantly different between cases and controls,
that is, among examples of different classes. At the same time, if we have an extremely large num-
ber of features, and relatively little data, many irrelevant markers may look better than this relevant
marker by chance alone, especially given even a modest level of noise as occurs in GWAS. Related
work [187] provides a formula to assess the false positive report probability (FPRP), the proba-
bility of no true association between a genetic variant and disease given a statistically significant
finding. If we assume there are around 1000 truly associated SNPs out of the total 500, 000 and
keep the significance level to be 0.05, the FPRP will be around 99%. This means almost all the
selected features in this case are false positives.
Hypothesis testing is one important statistical inference method for genetic association analy-
sis, since one can simply test the significance of association between one genetic marker and the
response variable. However in GWAS, there are usually hundreds of thousands of genetic markers
to test at the same time. Suppose that we have genotyped a total number of m SNPs, and we
have performed m tests simultaneously with each test applying to one genetic marker. In such
a multiple testing situation, we can categorize the results from the m tests as in Table 1.1. One
important criterion, false discovery rate (FDR), defined as E(N10/R|R > 0)P(R > 0), depicts
the expected proportion of incorrectly rejected null hypotheses (or type I errors) . Another crite-
rion, false non-discovery rate (FNR), defined as E(N01/S|S > 0)P(S > 0), depicts the expected
proportion of incorrectly non-accepted non-null hypotheses (or type II errors).
H0 not rejected H0 rejected Total
H0 true N00 N10 m0
H0 false N01 N11 m1
Total S R m
Table 1.1: The classification of tested hypothesis
A multiple testing procedure is termed valid if it controls FDR at the prespecified level α,