Table Of ContentLarge Covariance Estimation for Compositional Data
via Composition-Adjusted Thresholding
6
1 Yuanpei Cao, Wei Lin, and Hongzhe Li
0
2
n
a
J
Abstract
8
1
High-dimensional compositionaldataarisenaturallyinmanyapplications suchasmetage-
]
E nomic data analysis. The observed data lie in a high-dimensional simplex, and conventional
M
statistical methodsoften failtoproduce sensible results duetotheunit-sum constraint. Inthis
.
t article, we address the problem of covariance estimation for high-dimensional compositional
a
t data,andintroduceacomposition-adjustedthresholding(COAT)methodundertheassumption
s
[
that the basis covariance matrix is sparse. Our method is based on a decomposition relating
1
the compositional covariance to the basis covariance, which is approximately identifiable as
v
7 thedimensionality tendstoinfinity. Theresultingprocedurecanbeviewedasthresholding the
9
3 sample centered log-ratio covariance matrix and hence is scalable for large covariance matri-
4
ces. Werigorously characterize theidentifiability ofthecovarianceparameters, deriveratesof
0
. convergence under the spectral norm, and provide theoretical guarantees onsupport recovery.
1
0
Simulation studies demonstrate that the COAT estimator outperforms some naive threshold-
6
1 ing estimators that ignore the unique features of compositional data. We apply the proposed
:
v methodtotheanalysisofamicrobiomedatasetinordertounderstandthedependencestructure
i
X amongbacterial taxainthehumangut.
r
a
Key words: Adaptive thresholding; Basis covariance; Centered log-ratio covariance; High di-
mensionality; Microbiome;Regularization.
Yuanpei Cao is Ph.D. Candidate (E-mail: [email protected]) and Hongzhe Li is Professor (E-mail:
[email protected]),Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of
Pennsylvania,Philadelphia,PA19104. WeiLinisAssistantProfessor,SchoolofMathematicalSciencesandCenter
for Statistical Science, Peking University, Beijing 100871, China (E-mail: [email protected]). Cao and Li’s
research was supported in part by NIH grants CA127334and GM097505. Lin’s research was supported in part by
NSFCgrant71532001,theRecruitmentProgramofGlobalExperts,andastart-upgrantfromPekingUniversity.
1 Introduction
Compositional data, which represent the proportions or fractions of a whole, arise naturally in a
wide range of applications; examples include geochemical compositionsof rocks, household pat-
terns of expenditures, species compositionsof biological communities, and topic compositionsof
documents,amongmanyothers. Thisarticleisparticularlymotivatedbythemetagenomicanalysis
of microbiome data. The human microbiome is the totality of all microbes at various body sites,
whose importance in human health and disease has increasingly been recognized. Recent stud-
ies have revealed that microbiome composition varies based on diet, health, and the environment
(TheHumanMicrobiomeProject Consortium2012),and mayplayakeyroleincomplexdiseases
such as obesity, atherosclerosis, and Crohn’s disease (Turnbaughet al. 2009; Koethet al. 2013;
Lewiset al.2015).
Withthedevelopmentofnext-generationsequencingtechnologies,itisnowpossibletosurvey
the microbiome composition using direct DNA sequencing of either marker genes or the whole
metagenomes. After aligning these sequence reads to the reference microbial genomes, one can
quantify the relative abundances of microbial taxa. These sequencing-based microbiome studies,
however, only provide a relative, rather than absolute, measure of the abundances of community
components. The counts comprising these data (e.g., 16S rRNA gene reads or shotgun metage-
nomic reads) are set by the amount of genetic material extracted from the community or the se-
quencingdepth,andanalysistypicallybeginsbynormalizingtheobserveddatabythetotalnumber
ofcounts. Theresultingfractionsthusfallintoaclassofhigh-dimensionalcompositionaldatathat
we focus in this article. The high dimensionalityrefers to the fact that the numberof taxa may be
comparabletoormuchlarger thanthesamplesize.
An important question in metagenomic studies is to understand the co-occurrence and co-
exclusion relationship between microbial taxa, which would provide valuable insights into the
complexecologyofmicrobialcommunities(Faustet al.2012). Standardcorrelationanalysisfrom
the raw proportions, however, can lead to spurious results due to the unit-sum constraint; the pro-
1
portions tend to be correlated even if the absolute abundances are independent. Such undesired
effects should be removed in an analysis in order to make valid inferences about the underly-
ing biological processes. The compositional effects are further magnified by the low diversity
of microbiome data, that is, a few taxa make up the overwhelming majority of the microbiome
(Friedman and Alm2012).
Let X = (X ,...,X )T be a composition of p components (taxa) satisfying the simplex con-
1 p
straint
p
X > 0, j = 1,...,p, X = 1.
j j
j=1
X
Owing to the difficulties arising from the simplex constraint, it has been a long-standing question
howtoappropriatelymodel,estimate,andinterpretthecovariancestructureofcompositionaldata.
ThepioneeringworkofAitchison(1982,2003)introducedseveralequivalentmatrixspecifications
ofcompositionalcovariancestructuresviathelog-ratiosofcomponents. Statisticalmethodsbased
on these covariance models respect the unique features of compositional data and prove useful in
a variety of applications such as geochemical analysis. A potential disadvantage of these models,
however,isthattheylack adirectinterpretationintheusualsenseofcovariancesandcorrelations;
asaresult,itisunclearhowtoimposecertainstructuressuchassparsityinhighdimensions,which
iscrucial forourapplicationstomicrobiomedataanalysis.
Covariance matrix estimation is of fundamental importance in high-dimensional data analy-
sis and has attracted much recent interest. It is well known that the sample covariance matrix
performs poorly in high dimensions and regularization is thus indispensable. Bickel and Levina
(2008) and ElKaroui (2008) introduced regularized estimators by hard thresholding for large co-
variance matrices that satisfy certain notions of sparsity. Rothman, Levina,andZhu (2009) con-
sidered a more general class of thresholding functions, and Cai and Liu (2011) proposed adaptive
thresholdingthatadaptstothevariabilityofindividualentries. Exploitingafactormodelstructure,
Fan, Fan, andLv (2008) proposed a factor-based method for high-dimensional covariance matrix
estimation. Fan, Liao,and Mincheva(2013) extendedthe work by considering a conditionalspar-
2
sitystructureand developedaPOET methodby thresholdingprincipalorthogonalcomplements.
In this article, we address the problem of covariance estimation for high-dimensional compo-
sitionaldata. LetW = (W ,...,W )T withW > 0forallj beavectoroflatentvariables,called
1 p j
thebasis,that generatetheobserveddataviathenormalization
W
X = j , j = 1,...,p. (1)
j p W
i=1 i
EstimatingthecovariancestructureofWPhastraditionallybeenconsideredinfeasibleowingtothe
apparent lack of identifiability. By exploring a decomposition relating the compositional covari-
ance to the basis covariance, we find, however, that the nonidentifiabilityvanishes asymptotically
asthedimensionalitygrowsundercertainsparsityassumptions. Morespecifically,definethebasis
covariancematrixΩ = (ω0) by
0 ij p p
×
ω0 = Cov(Y ,Y ), (2)
ij i j
where Y = logW . Then Ω is approximatelyidentifiableas long as it belongsto a class of large
j j 0
sparsecovariancematrices.
Thesomewhatsurprising“blessingofdimensionality”allowsustodevelopasimple,two-step
method by first extracting a rank-2 component from the decomposition and then estimating the
sparse component Ω by thresholding the residual matrix. The resulting procedure can equiva-
0
lently be viewed as thresholding the sample centered log-ratio covariance matrix, and hence is
optimization-free and scalable for large covariance matrices. We call our method composition-
adjusted thresholding (COAT), which removes the “coat” of compositional effects from the co-
variancestructure. Wederiveratesofconvergenceunderthespectralnormandprovidetheoretical
guarantees on support recovery. Simulation studies demonstrate that the COAT estimator outper-
forms some naive thresholding estimators that ignore the unique features of compositional data.
Weillustrateourmethodbyanalyzingamicrobiomedatasetinordertounderstandthedependence
structureamong bacterialtaxain thehuman gut.
The covariance relationship, which was due to Aitchison (2003, sec. 4.11), has recently been
exploitedtodevelopalgorithmsforinferringcorrelationnetworksfrommetagenomicdata(Friedman andAlm
3
2012; Fanget al. 2015; Ban, An, andJiang 2015). Our contributions here are to turn the idea into
a principled approach to sparse covariance matrix estimation and provide statistical insights into
the issue of identifiability and the impacts of dimensionality. Our method also bears some resem-
blance to the POET method proposed by Fan, Liao,and Mincheva (2013) in that underlying both
methods is a low-rank plus sparse matrix decomposition. The rank-2 component in our method,
however, arises from the covariancestructure of compositionaldata rather than a factor model as-
sumption. As a result, it can be obtained by simple algebraic operations without computing the
principalcomponents.
The rest of the article is organized as follows. Section 2 reviews a covariance relationship
and addresses the issue of identifiability. Section 3 introduces the COAT methodology. Section
4 investigates the theoretical properties of the COAT estimator in terms of convergence rates and
support recovery. Simulation studies and an application to human gut microbiome data are pre-
sented in Sections 5 and 6, respectively. We conclude the article with some discussion in Section
7 andrelegateallproofs totheAppendix.
2 Identifiability of the Covariance Model
We first introduce some notation. Denote by , , , and the matrix L -
1 2 F max 1
k · k k · k k · k k · k
norm,spectral norm, Frobenius norm,and entrywiseL -norm,defined foramatrix A = (a ) by
ij
∞
A = max a , A = λ (ATA), A = a2 , and A = max a ,
k k1 j i| ij| k k2 max k kF i,j ij k kmax i,j| ij|
whereλ ( )dPenotesthelargestpeigenvalue. qP
max
·
Inthelatentvariablecovariancemodel(1)and(2),thebasiscovariancematrixΩ istheparam-
0
eterofinterest. Oneofthematrixspecificationsofcompositionalcovariancestructuresintroduced
by Aitchison(2003)is thevariationmatrixT = (τ0) defined by
0 ij p p
×
τ0 = Var(log(X /X )). (3)
ij i j
In viewoftherelationship(1), wecan decomposeτ0 as
ij
τ0 = Var(logW logW )
ij i − j
4
= Var(Y )+Var(Y ) 2Cov(Y ,Y )
i j i j
−
= ω0 +ω0 2ω0, (4)
ii jj − ij
orinmatrixform,
T = ω 1T +1ωT 2Ω , (5)
0 0 0 − 0
where ω = (ω0 ,...,ω0 )T and 1 = (1,...,1)T. Corresponding to the many-to-one relation-
0 11 pp
ship between bases and compositions, the basis covariance matrix Ω is unidentifiable from the
0
decomposition(5),sinceω 1T +1ωT andΩ areingeneralnotorthogonaltoeachother(withre-
0 0 0
specttotheusualEuclideaninnerproduct). Infact,usingthecenteredlog-ratiocovariancematrix
Γ = (γ0) defined by
0 ij p p
×
γ0 = Cov log(X /g(X)),log(X /g(X)) ,
ij { i j }
whereg(x) = ( p x )1/p isthegeometricmeanofavectorx = (x ,...,x )T,wecansimilarly
j=1 j 1 p
write Q
τ0 = Var log(X /g(X)) log(X /g(X))
ij { i − j }
= Var log(X /g(X)) +Var log(X /g(X)) 2Cov log(X /g(X),log(X /g(X))
i j i j
{ } { }− { }
= γ0 +γ0 2γ0,
ii jj − ij
orinmatrixform,
T = γ 1T +1γT 2Γ , (6)
0 0 0 − 0
whereγ = (γ0 ,...,γ0 )T and1 = (1,...,1)T. Unlike(5),thefollowingpropositionshowsthat
0 11 pp
(6)isanorthogonaldecompositionandhencethecomponentsγ 1T +1γT andΓ areidentifiable.
0 0 0
In addition, by comparing the decompositions (5) and (6), we can bound the difference between
Ω and itsidentifiablecounterpart Γ as follows.
0 0
Proposition 1. The components γ 1T + 1γT and Γ in the decomposition (6) are orthogonal to
0 0 0
5
eachother. Moreover,forthecovarianceparametersΩ andΓ inthedecompositions(5)and (6),
0 0
Ω Γ 3p 1 Ω .
0 0 max − 0 1
k − k ≤ k k
Proposition 1 entails that the covarianceparameter Ω is approximatelyidentifiableas long as
0
Ω = o(p). In particular, suppose that Ω belongs to a class of sparse covariance matrices
0 1 0
k k
consideredby Bickel and Levina(2008),
p
(q,s (p),M) Ω: Ω 0,maxω M,max ω q s (p) , (7)
0 jj ij 0
U ≡ ( ≻ j ≤ i | | ≤ )
j=1
X
where0 q < 1 andΩ 0 denotesthatΩ is positivedefinite. Then
≤ ≻
p p
Ω = max ω0 1 q ω0 q max (ω0ω0 )(1 q)/2 ω0 q M1 qs (p),
k 0k1 i | ij| − | ij| ≤ i ii jj − | ij| ≤ − 0
j=1 j=1
X X
andhencetheparametersΩ andΓ areasymptoticallyindistinguishablewhens (p) = o(p). This
0 0 0
allowsus touseΓ as aproxy forΩ and greatly facilitatesthedevelopmentofnew methodology
0 0
and associated theory. The intuition behind the approximate identifiability under the sparsity as-
sumptionis that the rank-2 component ω 1T +1ωT represents a global effect that spreads across
0 0
all rows and columns, while the sparse component Ω represents a local effect that is confined to
0
individualentries.
Also of interest is the exact identifiability of Ω over L -balls, which has been studied by
0 0
Fang etal. (2015) and Ban, An, andJiang (2015). The following result provides a sufficient and
necessary conditionfor theexact identifiabilityofΩ byconfining ittoan L -ball.
0 0
Proposition2. SupposethatΩ belongstotheL -ball
0 0
(s (p)) Ω: I(ω = 0) s (p) ,
0 e ij e
B ≡ 6 ≤
(i,jX):i<j
where p 5. Then there exist no two values of Ω that correspond to the same T in (5) if and
0 0
≥
onlyif s (p) < (p 1)/2.
e
−
AcounterexampleisprovidedintheproofofProposition2toshowthatthesparsityconditions
in Fanget al. (2015) and Ban, An,and Jiang (2015), which are both at the order of O(p2), do not
6
suffice. The identifiability condition in Proposition 2 essentially requires the average degree of
the correlation network to be less than 1, which is too restrictive to be useful in practice. This
illustratestheimportanceand necessityofintroducingthenotionofapproximateidentifiability.
3 A Sparse Covariance Estimator for Compositional Data
Supposethat(W ,X ),k = 1,...,n,areindependentcopiesof(W,X),wherethecompositions
k k
X = (X ,...,X )T are observed and the bases W = (W ,...,W )T are latent. In Section
k k1 kp k k1 kp
3.1, we rely on the decompositions (5) and (6) and Proposition 1 to develop an estimator of Ω ,
0
and inSection 3.2discusstheselectionofthetuningparameter.
3.1 Composition-Adjusted Thresholding
In view of Proposition 1, we wish to estimate the covariance parameter Ω via the proxy Γ . To
0 0
thisend, wefirst constructan empiricalestimateofΓ andthenapplyadaptivethresholdingtothe
0
estimate.
TherearetwoequivalentwaystoformtheestimateofΓ . Motivatedbythedecomposition(6),
0
onecan startwiththesamplecounterpart T = (τˆ ) ofT defined by
ij p p 0
×
n
1
τˆ =b (τ τ¯ )2,
ij kij ij
n −
k=1
X
where τ = log(X /X ) and τ¯ = n 1 n τ . A rank-2 component α1T + 1αT with
kij ki kj ij − k=1 kij
α = (αˆ ,...,αˆ )T canbeextractedfromthedPecomposition(6)byprojectingTontothesubspace
1 p
b b
α1T +1αT: α Rp ,which isgivenby
Ab ≡ { ∈ } b
1
αˆ = τˆ τˆ ,
i i
· − 2 ··
whereτˆ = p 1 p τˆ andτˆ = p 2 p τˆ . TheresidualmatrixΓ = (T α1T 1αT)/2,
i· − j=1 ij ·· − i,j=1 ij − − −
withentries P P
b b b b
1 1
γˆ = (τˆ αˆ αˆ ) = (τˆ τˆ τˆ +τˆ ),
ij ij i j ij i j
−2 − − −2 − · − · ··
isthenanestimateofΓ . Alternatively,ΓcanbeobtaineddirectlyasthesamplecounterpartofΓ
0 0
b
7
throughtheexpression
n
1
γˆ = (γ γ¯ )(γ γ¯ ), (8)
ij ki i kj j
n − −
k=1
X
whereγ = log(X /g(X )) and γ¯ = n 1 n γ .
kj kj k j − k=1 kj
Now applying adaptive thresholding toPΓ, we define the composition-adjusted thresholding
(COAT)estimator
b
Ω = (ωˆ ) withωˆ = S (γˆ ), (9)
ij p p ij λij ij
×
whereS ( )isa general threshobldingfunctionandλ > 0 areentry-dependent thresholds.
λ ij
·
In this article, we consider a class of general thresholding functions S ( ) that satisfy the fol-
λ
·
lowingconditions:
(i) S (z) = 0 for z λ;
λ
| | ≤
(ii) S (z) z λ forallz R.
λ
| − | ≤ ∈
Thesetwoconditionswereassumedby Rothman,Levina,and Zhu(2009)and Cai and Liu(2011)
along with another condition that is not required in our analysis. Examples of thresholding func-
tions belonging to this class include the hard thresholding rule S (z) = zI( z λ), the soft
λ
| | ≥
thresholding rule S (z) = sgn(z)( z λ) , and the adaptive lasso rule S (z) = z(1 λ/z η)
λ + λ +
| | − −| |
forη 1.
≥
The performance of the COAT estimator depends critically on the choice of thresholds. Us-
ing entry-adaptive thresholds may in general improve the performance over applying a universal
threshold. Toderiveadata-drivenchoiceofλ , define
ij
θ = Var (Y µ )(Y µ ) ,
ij i i j j
{ − − }
whereµ = EY . Wetakeλ to beoftheform
j j ij
λ = λ θˆ , (10)
ij ij
q
where θˆ are estimates of θ , and λ > 0 is a tuning parameter to be chosen, for example, by
ij ij
cross-validation. We rewrite(8) as γˆ = n 1 n γ , where γ = (γ γ¯ )(γ γ¯ ). Then
ij − k=1 kij kij ki − i kj − j
P
8
θ can beestimatedby
ij
n
1
θˆ = (γ γˆ )2.
ij kij ij
n −
k=1
X
3.2 Tuning Parameter Selection
Thethresholdsdefinedby(10)dependonthetuningparameterλ,whichcanbechosenthroughV-
( v)
foldcross-validation. DenotebyΩ − (λ)theCOATestimatebasedonthetrainingdataexcluding
thevthfold,andΓ theresidualmatrix(orthesamplecentered log-ratiocovariancematrix)based
v b
on the test data including only the vth fold. We choose the optimal value of λ that minimizes the
b
cross-validationerror
V
1 ( v) (v)
CV(λ) = Ω − (λ) Γ 2.
V k − kF
v=1
X
With the optimal λ, we then compute the COATb estimate bbased on the full dataset as our final
estimate. When the positive definiteness of the covariance estimate in finite samples is required
for interpretation, we follow the approach of Fan, Liao,and Mincheva(2013) and choose λ in the
rangewheretheminimumeigenvalueoftheCOAT estimateispositive.
4 Theoretical Properties
Inthissection,weinvestigatetheasymptoticpropertiesoftheCOATestimator. Asadistinguishing
featureofourtheoreticalanalysis,weassumeneithertheexactidentifiabilityoftheparametersnor
that the degree of (approximate) identifiability is dominated by the statistical error. Instead, the
degreeofidentifiabilityentersouranalysisandshowsupintheresultingrateofconvergence. Such
theoretical analysis is rare in the literature, but is extremely relevant for latent variable models in
the presence of nonidentifiability and is of theoretical interest in its own right. We introduce our
assumptions in Section 4.1, and present our main results on rates of convergence and support
recoveryin Section 4.2.
9