Table Of Content

Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding 6 1 Yuanpei Cao, Wei Lin, and Hongzhe Li 0 2 n a J Abstract 8 1 High-dimensional compositionaldataarisenaturallyinmanyapplications suchasmetage- ] E nomic data analysis. The observed data lie in a high-dimensional simplex, and conventional M statistical methodsoften failtoproduce sensible results duetotheunit-sum constraint. Inthis . t article, we address the problem of covariance estimation for high-dimensional compositional a t data,andintroduceacomposition-adjustedthresholding(COAT)methodundertheassumption s [ that the basis covariance matrix is sparse. Our method is based on a decomposition relating 1 the compositional covariance to the basis covariance, which is approximately identifiable as v 7 thedimensionality tendstoinfinity. Theresultingprocedurecanbeviewedasthresholding the 9 3 sample centered log-ratio covariance matrix and hence is scalable for large covariance matri- 4 ces. Werigorously characterize theidentifiability ofthecovarianceparameters, deriveratesof 0 . convergence under the spectral norm, and provide theoretical guarantees onsupport recovery. 1 0 Simulation studies demonstrate that the COAT estimator outperforms some naive threshold- 6 1 ing estimators that ignore the unique features of compositional data. We apply the proposed : v methodtotheanalysisofamicrobiomedatasetinordertounderstandthedependencestructure i X amongbacterial taxainthehumangut. r a Key words: Adaptive thresholding; Basis covariance; Centered log-ratio covariance; High dimensionality; Microbiome;Regularization. Yuanpei Cao is Ph.D. Candidate (E-mail: [email protected]) and Hongzhe Li is Professor (E-mail: [email protected]),Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania,Philadelphia,PA19104. WeiLinisAssistantProfessor,SchoolofMathematicalSciencesandCenter for Statistical Science, Peking University, Beijing 100871, China (E-mail: [email protected]). Cao and Li’s research was supported in part by NIH grants CA127334and GM097505. Lin’s research was supported in part by NSFCgrant71532001,theRecruitmentProgramofGlobalExperts,andastart-upgrantfromPekingUniversity. 1 Introduction Compositional data, which represent the proportions or fractions of a whole, arise naturally in a wide range of applications; examples include geochemical compositionsof rocks, household pat- terns of expenditures, species compositionsof biological communities, and topic compositionsof documents,amongmanyothers. Thisarticleisparticularlymotivatedbythemetagenomicanalysis of microbiome data. The human microbiome is the totality of all microbes at various body sites, whose importance in human health and disease has increasingly been recognized. Recent studies have revealed that microbiome composition varies based on diet, health, and the environment (TheHumanMicrobiomeProject Consortium2012),and mayplayakeyroleincomplexdiseases such as obesity, atherosclerosis, and Crohn’s disease (Turnbaughet al. 2009; Koethet al. 2013; Lewiset al.2015). Withthedevelopmentofnext-generationsequencingtechnologies,itisnowpossibletosurvey the microbiome composition using direct DNA sequencing of either marker genes or the whole metagenomes. After aligning these sequence reads to the reference microbial genomes, one can quantify the relative abundances of microbial taxa. These sequencing-based microbiome studies, however, only provide a relative, rather than absolute, measure of the abundances of community components. The counts comprising these data (e.g., 16S rRNA gene reads or shotgun metagenomic reads) are set by the amount of genetic material extracted from the community or the se- quencingdepth,andanalysistypicallybeginsbynormalizingtheobserveddatabythetotalnumber ofcounts. Theresultingfractionsthusfallintoaclassofhigh-dimensionalcompositionaldatathat we focus in this article. The high dimensionalityrefers to the fact that the numberof taxa may be comparabletoormuchlarger thanthesamplesize. An important question in metagenomic studies is to understand the co-occurrence and co- exclusion relationship between microbial taxa, which would provide valuable insights into the complexecologyofmicrobialcommunities(Faustet al.2012). Standardcorrelationanalysisfrom the raw proportions, however, can lead to spurious results due to the unit-sum constraint; the pro- 1 portions tend to be correlated even if the absolute abundances are independent. Such undesired effects should be removed in an analysis in order to make valid inferences about the underlying biological processes. The compositional effects are further magnified by the low diversity of microbiome data, that is, a few taxa make up the overwhelming majority of the microbiome (Friedman and Alm2012). Let X = (X ,...,X )T be a composition of p components (taxa) satisfying the simplex con- 1 p straint p X > 0, j = 1,...,p, X = 1. j j j=1 X Owing to the difficulties arising from the simplex constraint, it has been a long-standing question howtoappropriatelymodel,estimate,andinterpretthecovariancestructureofcompositionaldata. ThepioneeringworkofAitchison(1982,2003)introducedseveralequivalentmatrixspecifications ofcompositionalcovariancestructuresviathelog-ratiosofcomponents. Statisticalmethodsbased on these covariance models respect the unique features of compositional data and prove useful in a variety of applications such as geochemical analysis. A potential disadvantage of these models, however,isthattheylack adirectinterpretationintheusualsenseofcovariancesandcorrelations; asaresult,itisunclearhowtoimposecertainstructuressuchassparsityinhighdimensions,which iscrucial forourapplicationstomicrobiomedataanalysis. Covariance matrix estimation is of fundamental importance in high-dimensional data analysis and has attracted much recent interest. It is well known that the sample covariance matrix performs poorly in high dimensions and regularization is thus indispensable. Bickel and Levina (2008) and ElKaroui (2008) introduced regularized estimators by hard thresholding for large covariance matrices that satisfy certain notions of sparsity. Rothman, Levina,andZhu (2009) con- sidered a more general class of thresholding functions, and Cai and Liu (2011) proposed adaptive thresholdingthatadaptstothevariabilityofindividualentries. Exploitingafactormodelstructure, Fan, Fan, andLv (2008) proposed a factor-based method for high-dimensional covariance matrix estimation. Fan, Liao,and Mincheva(2013) extendedthe work by considering a conditionalspar- 2 sitystructureand developedaPOET methodby thresholdingprincipalorthogonalcomplements. In this article, we address the problem of covariance estimation for high-dimensional compositionaldata. LetW = (W ,...,W )T withW > 0forallj beavectoroflatentvariables,called 1 p j thebasis,that generatetheobserveddataviathenormalization W X = j , j = 1,...,p. (1) j p W i=1 i EstimatingthecovariancestructureofWPhastraditionallybeenconsideredinfeasibleowingtothe apparent lack of identifiability. By exploring a decomposition relating the compositional covariance to the basis covariance, we find, however, that the nonidentifiabilityvanishes asymptotically asthedimensionalitygrowsundercertainsparsityassumptions. Morespecifically,definethebasis covariancematrixΩ = (ω0) by 0 ij p p × ω0 = Cov(Y ,Y ), (2) ij i j where Y = logW . Then Ω is approximatelyidentifiableas long as it belongsto a class of large j j 0 sparsecovariancematrices. Thesomewhatsurprising“blessingofdimensionality”allowsustodevelopasimple,two-step method by first extracting a rank-2 component from the decomposition and then estimating the sparse component Ω by thresholding the residual matrix. The resulting procedure can equiva- 0 lently be viewed as thresholding the sample centered log-ratio covariance matrix, and hence is optimization-free and scalable for large covariance matrices. We call our method composition- adjusted thresholding (COAT), which removes the “coat” of compositional effects from the covariancestructure. Wederiveratesofconvergenceunderthespectralnormandprovidetheoretical guarantees on support recovery. Simulation studies demonstrate that the COAT estimator outperforms some naive thresholding estimators that ignore the unique features of compositional data. Weillustrateourmethodbyanalyzingamicrobiomedatasetinordertounderstandthedependence structureamong bacterialtaxain thehuman gut. The covariance relationship, which was due to Aitchison (2003, sec. 4.11), has recently been exploitedtodevelopalgorithmsforinferringcorrelationnetworksfrommetagenomicdata(Friedman andAlm 3 2012; Fanget al. 2015; Ban, An, andJiang 2015). Our contributions here are to turn the idea into a principled approach to sparse covariance matrix estimation and provide statistical insights into the issue of identifiability and the impacts of dimensionality. Our method also bears some resem- blance to the POET method proposed by Fan, Liao,and Mincheva (2013) in that underlying both methods is a low-rank plus sparse matrix decomposition. The rank-2 component in our method, however, arises from the covariancestructure of compositionaldata rather than a factor model as- sumption. As a result, it can be obtained by simple algebraic operations without computing the principalcomponents. The rest of the article is organized as follows. Section 2 reviews a covariance relationship and addresses the issue of identifiability. Section 3 introduces the COAT methodology. Section 4 investigates the theoretical properties of the COAT estimator in terms of convergence rates and support recovery. Simulation studies and an application to human gut microbiome data are pre- sented in Sections 5 and 6, respectively. We conclude the article with some discussion in Section 7 andrelegateallproofs totheAppendix. 2 Identifiability of the Covariance Model We first introduce some notation. Denote by , , , and the matrix L - 1 2 F max 1 k · k k · k k · k k · k norm,spectral norm, Frobenius norm,and entrywiseL -norm,defined foramatrix A = (a ) by ij ∞ A = max a , A = λ (ATA), A = a2 , and A = max a , k k1 j i| ij| k k2 max k kF i,j ij k kmax i,j| ij| whereλ ( )dPenotesthelargestpeigenvalue. qP max · Inthelatentvariablecovariancemodel(1)and(2),thebasiscovariancematrixΩ istheparam- 0 eterofinterest. Oneofthematrixspecificationsofcompositionalcovariancestructuresintroduced by Aitchison(2003)is thevariationmatrixT = (τ0) defined by 0 ij p p × τ0 = Var(log(X /X )). (3) ij i j In viewoftherelationship(1), wecan decomposeτ0 as ij τ0 = Var(logW logW ) ij i − j 4 = Var(Y )+Var(Y ) 2Cov(Y ,Y ) i j i j − = ω0 +ω0 2ω0, (4) ii jj − ij orinmatrixform, T = ω 1T +1ωT 2Ω , (5) 0 0 0 − 0 where ω = (ω0 ,...,ω0 )T and 1 = (1,...,1)T. Corresponding to the many-to-one relation- 0 11 pp ship between bases and compositions, the basis covariance matrix Ω is unidentifiable from the 0 decomposition(5),sinceω 1T +1ωT andΩ areingeneralnotorthogonaltoeachother(withre- 0 0 0 specttotheusualEuclideaninnerproduct). Infact,usingthecenteredlog-ratiocovariancematrix Γ = (γ0) defined by 0 ij p p × γ0 = Cov log(X /g(X)),log(X /g(X)) , ij { i j } whereg(x) = ( p x )1/p isthegeometricmeanofavectorx = (x ,...,x )T,wecansimilarly j=1 j 1 p write Q τ0 = Var log(X /g(X)) log(X /g(X)) ij { i − j } = Var log(X /g(X)) +Var log(X /g(X)) 2Cov log(X /g(X),log(X /g(X)) i j i j { } { }− { } = γ0 +γ0 2γ0, ii jj − ij orinmatrixform, T = γ 1T +1γT 2Γ , (6) 0 0 0 − 0 whereγ = (γ0 ,...,γ0 )T and1 = (1,...,1)T. Unlike(5),thefollowingpropositionshowsthat 0 11 pp (6)isanorthogonaldecompositionandhencethecomponentsγ 1T +1γT andΓ areidentifiable. 0 0 0 In addition, by comparing the decompositions (5) and (6), we can bound the difference between Ω and itsidentifiablecounterpart Γ as follows. 0 0 Proposition 1. The components γ 1T + 1γT and Γ in the decomposition (6) are orthogonal to 0 0 0 5 eachother. Moreover,forthecovarianceparametersΩ andΓ inthedecompositions(5)and (6), 0 0 Ω Γ 3p 1 Ω . 0 0 max − 0 1 k − k ≤ k k Proposition 1 entails that the covarianceparameter Ω is approximatelyidentifiableas long as 0 Ω = o(p). In particular, suppose that Ω belongs to a class of sparse covariance matrices 0 1 0 k k consideredby Bickel and Levina(2008), p (q,s (p),M) Ω: Ω 0,maxω M,max ω q s (p) , (7) 0 jj ij 0 U ≡ ( ≻ j ≤ i | | ≤ ) j=1 X where0 q < 1 andΩ 0 denotesthatΩ is positivedefinite. Then ≤ ≻ p p Ω = max ω0 1 q ω0 q max (ω0ω0 )(1 q)/2 ω0 q M1 qs (p), k 0k1 i | ij| − | ij| ≤ i ii jj − | ij| ≤ − 0 j=1 j=1 X X andhencetheparametersΩ andΓ areasymptoticallyindistinguishablewhens (p) = o(p). This 0 0 0 allowsus touseΓ as aproxy forΩ and greatly facilitatesthedevelopmentofnew methodology 0 0 and associated theory. The intuition behind the approximate identifiability under the sparsity as- sumptionis that the rank-2 component ω 1T +1ωT represents a global effect that spreads across 0 0 all rows and columns, while the sparse component Ω represents a local effect that is confined to 0 individualentries. Also of interest is the exact identifiability of Ω over L -balls, which has been studied by 0 0 Fang etal. (2015) and Ban, An, andJiang (2015). The following result provides a sufficient and necessary conditionfor theexact identifiabilityofΩ byconfining ittoan L -ball. 0 0 Proposition2. SupposethatΩ belongstotheL -ball 0 0 (s (p)) Ω: I(ω = 0) s (p) , 0 e ij e B ≡  6 ≤   (i,jX):i<j  where p 5. Then there exist no two values of Ω that correspond to the same T in (5) if and  0  0 ≥ onlyif s (p) < (p 1)/2. e − AcounterexampleisprovidedintheproofofProposition2toshowthatthesparsityconditions in Fanget al. (2015) and Ban, An,and Jiang (2015), which are both at the order of O(p2), do not 6 suffice. The identifiability condition in Proposition 2 essentially requires the average degree of the correlation network to be less than 1, which is too restrictive to be useful in practice. This illustratestheimportanceand necessityofintroducingthenotionofapproximateidentifiability. 3 A Sparse Covariance Estimator for Compositional Data Supposethat(W ,X ),k = 1,...,n,areindependentcopiesof(W,X),wherethecompositions k k X = (X ,...,X )T are observed and the bases W = (W ,...,W )T are latent. In Section k k1 kp k k1 kp 3.1, we rely on the decompositions (5) and (6) and Proposition 1 to develop an estimator of Ω , 0 and inSection 3.2discusstheselectionofthetuningparameter. 3.1 Composition-Adjusted Thresholding In view of Proposition 1, we wish to estimate the covariance parameter Ω via the proxy Γ . To 0 0 thisend, wefirst constructan empiricalestimateofΓ andthenapplyadaptivethresholdingtothe 0 estimate. TherearetwoequivalentwaystoformtheestimateofΓ . Motivatedbythedecomposition(6), 0 onecan startwiththesamplecounterpart T = (τˆ ) ofT defined by ij p p 0 × n 1 τˆ =b (τ τ¯ )2, ij kij ij n − k=1 X where τ = log(X /X ) and τ¯ = n 1 n τ . A rank-2 component α1T + 1αT with kij ki kj ij − k=1 kij α = (αˆ ,...,αˆ )T canbeextractedfromthedPecomposition(6)byprojectingTontothesubspace 1 p b b α1T +1αT: α Rp ,which isgivenby Ab ≡ { ∈ } b 1 αˆ = τˆ τˆ , i i · − 2 ·· whereτˆ = p 1 p τˆ andτˆ = p 2 p τˆ . TheresidualmatrixΓ = (T α1T 1αT)/2, i· − j=1 ij ·· − i,j=1 ij − − − withentries P P b b b b 1 1 γˆ = (τˆ αˆ αˆ ) = (τˆ τˆ τˆ +τˆ ), ij ij i j ij i j −2 − − −2 − · − · ·· isthenanestimateofΓ . Alternatively,ΓcanbeobtaineddirectlyasthesamplecounterpartofΓ 0 0 b 7 throughtheexpression n 1 γˆ = (γ γ¯ )(γ γ¯ ), (8) ij ki i kj j n − − k=1 X whereγ = log(X /g(X )) and γ¯ = n 1 n γ . kj kj k j − k=1 kj Now applying adaptive thresholding toPΓ, we define the composition-adjusted thresholding (COAT)estimator b Ω = (ωˆ ) withωˆ = S (γˆ ), (9) ij p p ij λij ij × whereS ( )isa general threshobldingfunctionandλ > 0 areentry-dependent thresholds. λ ij · In this article, we consider a class of general thresholding functions S ( ) that satisfy the fol- λ · lowingconditions: (i) S (z) = 0 for z λ; λ | | ≤ (ii) S (z) z λ forallz R. λ | − | ≤ ∈ Thesetwoconditionswereassumedby Rothman,Levina,and Zhu(2009)and Cai and Liu(2011) along with another condition that is not required in our analysis. Examples of thresholding functions belonging to this class include the hard thresholding rule S (z) = zI( z λ), the soft λ | | ≥ thresholding rule S (z) = sgn(z)( z λ) , and the adaptive lasso rule S (z) = z(1 λ/z η) λ + λ + | | − −| | forη 1. ≥ The performance of the COAT estimator depends critically on the choice of thresholds. Us- ing entry-adaptive thresholds may in general improve the performance over applying a universal threshold. Toderiveadata-drivenchoiceofλ , define ij θ = Var (Y µ )(Y µ ) , ij i i j j { − − } whereµ = EY . Wetakeλ to beoftheform j j ij λ = λ θˆ , (10) ij ij q where θˆ are estimates of θ , and λ > 0 is a tuning parameter to be chosen, for example, by ij ij cross-validation. We rewrite(8) as γˆ = n 1 n γ , where γ = (γ γ¯ )(γ γ¯ ). Then ij − k=1 kij kij ki − i kj − j P 8 θ can beestimatedby ij n 1 θˆ = (γ γˆ )2. ij kij ij n − k=1 X 3.2 Tuning Parameter Selection Thethresholdsdefinedby(10)dependonthetuningparameterλ,whichcanbechosenthroughV- ( v) foldcross-validation. DenotebyΩ − (λ)theCOATestimatebasedonthetrainingdataexcluding thevthfold,andΓ theresidualmatrix(orthesamplecentered log-ratiocovariancematrix)based v b on the test data including only the vth fold. We choose the optimal value of λ that minimizes the b cross-validationerror V 1 ( v) (v) CV(λ) = Ω − (λ) Γ 2. V k − kF v=1 X With the optimal λ, we then compute the COATb estimate bbased on the full dataset as our final estimate. When the positive definiteness of the covariance estimate in finite samples is required for interpretation, we follow the approach of Fan, Liao,and Mincheva(2013) and choose λ in the rangewheretheminimumeigenvalueoftheCOAT estimateispositive. 4 Theoretical Properties Inthissection,weinvestigatetheasymptoticpropertiesoftheCOATestimator. Asadistinguishing featureofourtheoreticalanalysis,weassumeneithertheexactidentifiabilityoftheparametersnor that the degree of (approximate) identifiability is dominated by the statistical error. Instead, the degreeofidentifiabilityentersouranalysisandshowsupintheresultingrateofconvergence. Such theoretical analysis is rare in the literature, but is extremely relevant for latent variable models in the presence of nonidentifiability and is of theoretical interest in its own right. We introduce our assumptions in Section 4.1, and present our main results on rates of convergence and support recoveryin Section 4.2. 9

Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding PDF

0.35 MB·

by Yuanpei Cao

#journals #arxiv

Checking for file health...

Save to my drive

Quick download

Download

Download Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding PDF Free - Full Version

by Yuanpei Cao| 0.35

Download Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding by Yuanpei Cao in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding

No description available for this book.

Detailed Information

Author:	Yuanpei Cao
File Size:	0.35
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding PDF?

Yes, on https://PDFdrive.to you can download Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding by Yuanpei Cao completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding on my mobile device?

After downloading Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding?

Yes, this is the complete PDF version of Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding by Yuanpei Cao. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.