Table Of Content

Agglomerative Info-Clustering Chung Chan, Ali Al-Bashabsheh and Qiaoqiao Zhou Abstract—An agglomerative clustering of random variables is search for an iterative info-clustering algorithm. A divisive proposed, where clusters of random variables sharing the max- clustering approach was proposed in [1, Algorithms 1 and 2] imum amount of multivariate mutual information are merged thatbreaksdownthecomputationbysplittingtheentiresetof successively to form larger clusters. Compared to the previous RVs successively into increasingly smaller clusters. However, info-clusteringalgorithms,theagglomerativeapproachallowsthe computation to stop earlier when clusters of desired size and doing so appears to be inefficient, requiring Ω(m3SFM(m)) accuracy are obtained. An efficient algorithm is also derived time in the worst case. Furthermore, it computes the larger 7 based on the submodularity of entropy and the duality between clusters first, the entropy function of which is more difficult 1 the principal sequence of partitions and the principal sequence 0 toestimatefromdata,andsotheerrormaybecarriedforward for submodular functions. 2 to subsequent computations of smaller clusters. b I. INTRODUCTION In this work, we propose an agglomerative info-clustering e approach that aims to resolve the above issues. The idea is F Weconsidertheinfo-clusteringparadigmproposedin[1].It to start with smaller clusters first and merge them to form is a hierarchical clustering of a finite set of random variables 4 largerclusterssuccessively.It turnsoutthat the algorithmcan (RVs) based on the multivariate mutual information (MMI) 2 be implemented more efficiently than the divisive approach, defined in [2]. The MMI is a natural extension of Shannon’s by relating the PSP to another structure called the principal ] mutualinformationto themultivariatecaseinvolvingpossibly T sequence (PS) [12–14]. A similar duality between the PSP more than two RVs. It was first proposed in [3] as a measure I and PS was also used in [15] to relate info-clustering to the s. of mutual information after identifying the divergence upper problem of feature selection. The contribution of this work is c bound of the secret key agreement problem [4] to be loose thederivationofarigorousinformation-theoreticinterpretation [ in the case with helper but tight in the no-helper case, which useful for the clustering problem, based on which further establishedanoperationalmeaningoftheMMIasthe secrecy 3 heuristics for estimation, approximation, or model reduction v capacity [5]. The MMI was also shown to be equal to the as in [1] can be developed. 6 undirectednetworkcodingthroughput[6] underthematroidal 92 undirected network link model [7]. II. MOTIVATION The info-clustering solution was shown in [1] to coincide 4 Before a rigorous and general treatment, we introduce the with an elegant mathematical structure called the principle 0 problem and results informally using the following example . sequence of partitions (PSP) [8] of a submodular function, 1 from[1,Figure1a]:ConsiderthefollowingRVsdefinedusing namely,thatoftheentropyfunction[9]oftheRVs tobeclus- 0 the uniform and independent bits X ,X ,X and X 7 tered. This leads to an algorithm [1, Algorithm 3] similar to a b c d 1 [10] that computesthe clusteringsolution in O(m2SFM(m)) Z :=(X ,X ), Z :=(X ,X ), Z :=X , v: time, where SFM(m) is the time required to minimize a Z1 :=X a, d Z2 :=X a, d Z3 :=Xa. (2.1) 4 b 5 b 6 c i submodular function on a ground set of size m. In practice, X however; 1) one may want to obtain clusters of a desired Info-clustering[1] is a hierarchical clustering approach based r size rather than the entire hierarchy of clusters of different on a multivariate measure of the information shared among a sizes; and 2) the entropy function needs to be estimated (multiple) RVs. As will be discussed more precisely in a from data, which can be difficult for a large set of random subsequent section, info-clustering provides clusters for dif- variables [11]. These practical considerations motivate the ferent thresholds γ ∈ R, where a cluster is an inclusion-wise maximal subset of RVs that share more than γ amount of C.Chan(email:[email protected]), A.Al-Bashabsheh,andQ.Zhou information.(We donotregardasingletonasa cluster.)Since arewiththeInstitute ofNetwork Coding atthe Chinese University ofHong the correlation structure of the RVs in (2.1) is simple, let us Kong, the Shenzhen Key Laboratory of Network Coding Key Technology andApplication, China, andtheShenzhen Research Institute oftheChinese for the moment define such a measure of information among University ofHongKong. Z , for any B ⊆ {1,...,6} with |B| ≥ 2, as the number of B TheworkdescribedinthispaperwassupportedbyagrantfromUniversity bits shared by Z and denote it by I(Z ). Then we have Grants Committee of the Hong Kong Special Administrative Region, China B B (ProjectNo.AoE/E-02/08),andsupportedpartiallybyagrantfromShenzhen • I(Z{1,...,6}) = 0 since, e.g., Z{1,...,5} and Z6 share no ScienceandTechnologyInnovationCommittee(JSGG20160301170514984), bits. (Similarly,I(Z )=0 for ∅=6 B ⊆{1,··· ,5}; theChinese University ofHongKong(Shenzhen), China. B∪{6} The work of C. Chan was supported in part by The Vice-Chancellor’s I(Z{1,4})=0; etc.) One-offDiscretionaryFundofTheChineseUniversityofHongKong(Project • I(Z{1,2,3}) = 1 since Z1, Z2, and Z3 share the bit Xa. Nos.VCF2014030andVCF2015007),andagrantfromtheUniversityGrants (Similarly, I(Z ) = 1 = I(Z ).) We also have CommitteeoftheHongKongSpecialAdministrative Region,China(Project {1,3} {2,3} No.14200714). I(Z{4,5})=1 since Z4 and Z5 share the bit Xb. • I(Z{1,2})=2 since Z1 andZ2 share the two bitsXa and finding clusters of smaller amount of shared bits. Moreover, X . (This is the only set of RVs sharing more than one as we will subsequently show, when the information measure d bit.) ischosentobethe MMIin[2],theagglomerativeapproachis • There is no set of RVs that share more than two bits. computationally more efficient compared to the divisive one. Hence, for all γ ∈R, the collection of clusters at thresholdγ, III. PROBLEM FORMULATION denoted as C (Z ), is given by [1, Figure 1b] γ {1,...,6} Given a random vector Z := (Z | i ∈ V) where V is an V i {{1,...,6}}, γ <0 ordered finite set of |V| > 1 RVs. The set of clusters at any {{1,2,3},{4,5}}, γ ∈[0,1) threshold γ ∈R is defined in [1] as Cγ(Z{1,...,6})= {{1,2}}, γ ∈[1,2) (2.2)  ∅, γ ≥2. Cγ(ZV):=maximal{B ⊆V ||B|>1,I(ZB)>γ}, (3.1) For instance, for any γ <0, there is onlyone cluster, namely, where maximalF := {B ∈ F | ∄B′ ∈ F,B ( B′} denotes the collection of inclusion-wise maximal elements the entire set of RVs since any subset containing at least of any collection F of subsets, and I(Z ) is a multivariate two RVs satisfies the threshold constraint and the entire set B information measure satisfying {1,...,6} is trivially the maximal one. For γ = 0, the threshold constraint dictates that we seek collections of RVs I(Z )≥min{I(Z ),I(Z )} (3.2) that share a strictly positive number of bits, i.e., one or two B1∪B2 B1 B2 bits in this example. Of such sets, it is easy to verify that the for all Bi ⊆ V with |Bi| > 1, i ∈ {1,2} and B1 ∩B2 6= maximalones are {1,2,3}and {4,5}.This remains to be the ∅. It was shown in [1, Theorem 3] that the clusters form a caseforanyγ ∈[0,1).Forγ =1,weseekcollectionsofRVs laminar family, i.e., for any γ′ ≤ γ′′, C′ ∈ Cγ′(ZV), and thatsharemorethanonebits,i.e.,twobitsinthisexample.The C′′ ∈Cγ′′(ZV), we have set {1,2} is the only such set, and so, it is trivially maximal. C′∩C′′ =∅ or C′ ⊇C′′. (3.3) This remains to be the case for γ ∈[1,2). Finally, for γ ≥2, we have no clusters since no set of RVs share more than two In particular, clusters at the same threshold must be disjoint. bits of information. Consequently, the clustering solution can be characterized as The reader may have observed the following hierarchical follows by set partitions of V, the collection of which is structure: Starting with the cluster {1,...,6} at sufficiently denoted as Π(V). small threshold γ, the cluster breaks into two smaller clusters Proposition 3.1 ([1, Theorems 1 and 4]) Theproperty(3.2) {1,2,3} and {4,5}, where each is a maximal subset that implies that the clustering solution (3.1) satisfies has more shared information bits than the original cluster. Continuing in this fashion, the cluster {1,2,3} breaks into Cγ(ZV)=Pℓ\{{i}|i∈V} for γ ∈[γℓ,γℓ+1],0≤ℓ≤N the smaller cluster {1,2}, at which point no further breakage where N is a positive integer; γ := −∞ and γ := ∞ 0 N+1 is possible and all clusters have been found. More gener- for convenience; ally, under a proper choice of the multivariate information measure, the hierarchical structure of the clusters persists, −∞<γ1 <···<γN <∞ (3.4) thereby allowing a divisive algorithm for finding the clusters is a sequence of distinct critical values from R (consisting of [1, Algorithm 1]. The algorithm starts with the entire set of the thresholds at which the set of clusters changes); and RVsthenproceedsiterativelytobreaktheclustersintosmaller and smaller disjoint clusters. P ={V}≻P ≻···≻P ≻P ={{i}:i∈V} (3.5) 0 1 N−1 N Inthis work,we proposethe reverseprocedureforcomput- isasequenceofincreasinglyfinerpartitionsofV fromΠ(V), ing the clusters. Namely, an agglomerative approach where where P′ (cid:23)P means that RVs gradually group into larger and larger clusters. In our example, starting with the singletons for sufficiently large ∀C ∈P,∃C′ ∈P′ :C ⊆C′, (3.6) value of γ, merge {1} and {2} into a cluster since such and“≻”denotesthestrictinequality(i.e.,theinclusionabove a merging results in the maximal subset {1,2} with the is strict for at least one C ∈P). ✷ maximum number of shared bits. Continue to merge {1,2} with{3}andmerge{4}with{5}toformthemaximalsubsets With the laminar structure, a divisive clustering algorithm {1,2,3} and {4,5} with the second largest number of shared wasgivenin[1]tocomputeγℓ andPℓ byiterativelyproducing bits. Continue in the same way to merge {1,2,3}, {4,5}, finer partitions from coarser ones, i.e., from ℓ=1 to ℓ=N. and {6} into the cluster {1,...,6}, at which point no further The work in [2] considered a particularly meaningful multi- merging is possible and all clusters have been found. variate information measure, namely, the multivariate mutual The practicality of an agglomerative approach in compari- information (MMI) defined as son to a divisive one is that the former identifies clusters of I(Z ):= min I (Z ), where (3.7a) larger amount of shared bits first. A larger amount of shared V P V P∈Π′(V) bits can be estimated from data more accurately, and so it 1 I (Z ):= H(Z )−H(Z ) (3.7b) serves as a stronger base for the subsequent estimation for P V C V |P|−1 C∈P hX i P∗(Z )={{1,2,3},{4,5},{6}}(fromwhichI(Z )isread- 1 2 3 V V ily available), declares the non-singleton elements as clusters 4 5 6 at threshold I(Z ), and proceeds iteratively by picking any V cluster (of size larger than two) and computing its finest 1 2 3 1 2 3 1 2 3 optimalpartition,etc.Inourexample,thereisonlyonecluster, 4 5 6 4 5 6 4 5 6 the set {1,2,3} at threshold 0. The finest optimal partition of Z isshowninFig.1b,whichresultsinthecluster{1,2} 1 2 3 at{1th,2r,e3s}holdI(Z )=1. After this, the divisive algorithm 1 2 3 {1,2,3} terminates in this example. ✷ 4 5 6 1 2 3 (a) (b) Algorithm 1: Agglomerative info-clustering. Data: Statistics of Z sufficient for calculating the Fig.1:Optimalpartitionsof:(a)I(Z )and(b)I(Z ).In V V {1,2,3} entropy function h(B) for B ⊆V :={1,...,n}. each case, the fundamentalpartition is the one at the bottom, Result: The arrays L and PSP contain the γ ’s and P ’s ℓ ℓ where the associated clusters are circled with thick lines. in Proposition 3.1. More precisely, for 1≤ℓ≤N, the entries of the arrays PSP and L are PSP[s]=P (Z ) and L[s]=γ (Z ) for and Π′(V):=Π(V)\{{V}} is the set of all partitions of V ℓ V ℓ V s=|P (Z )|, and are null otherwise. intoatleasttwonon-emptydisjointsubsets.TheMMIwasfirst ℓ V proposed as a measure of mutual information in [3] and was 1 L, PSP← empty arrays, each of length n; later shown in [2] to satisfy (3.2), together with various other 2 PSP[n] ←{{i}:i∈V}, s←n; properties that naturally extend those of Shannon’s mutual 3 while s>1 do information to the multivariate case. The MMI was also later 4 (γ,P′)← Fuse(PSP[l]); referredtoin[16]as“sharedinformation.”Animportantresult 5 L[s]←γ; that inspired the info-clustering paradigm [1] is: 6 s←|P′|; 7 PSP[s]←P′; Proposition 3.2 ([2, Theorems 5.2 and 5.3]) The optimal 8 end partitionsachievingthe MMI in (3.7)togetherwith the trivial partition {V} form a lattice w.r.t. (3.6). The minimum/finest optimal partition, denoted by P∗(Z ), satisfies Instead of the divisive approach, we consider here an V agglomerative approach shown in Algorithm 1 that computes P∗(ZV)\{{i}|i∈V}=CI(ZV)(ZV), γℓ’s and Pℓ’s iteratively from ℓ=N down to ℓ=1. We will where C is defined in (3.1) using the MMI. In particular, give an efficient implementation of the subroutine Fuse in γ γ1 =I(ZV) and P1 =P∗(ZV) in Proposition 3.1. ✷ Algorithm 2 that computes γℓ and Pℓ−1 from Pℓ for any ℓ. In particular, we will show that it suffices to compute Together with the laminar structure in Proposition 3.1, it can be argued that any algorithm for computing P∗(ZV) can I∗(ZV):=max{I(ZB)|B ⊆V,|B|>1} and (3.8) be applied iteratively for divisive info-clustering as in [1, C∗(Z ):=maximal{B ⊆V ||B|>1, Algorithm 2]. However, doing so appears to be less efficient V (3.9) compared to the alternative approach in [1, Algorithm 3] I(ZB)=I∗(ZV)}, that computes all the clusters but not in any particular order. which are clearly the last critical value γ and the non- N Nevertheless, in practice it is desirable to have an iterative singleton elements of the second last partition P respec- N−1 approach that can stop when further computation is not of tively. interest or is meaningless due to errors in estimating or approximating the entropies from data. IV. PRELIMINARIES Example 1 As an illustration of Proposition 3.2 and the The ability to compute info-clustering solution efficiently divisiveapproach,considerourmotivatingexample.Thefinest stemsfromthe submodularityof entropy[17], or equivalently optimal partition P∗(Z ) = {{1,2,3},{4,5},{6}} is shown V the fact that mutual information is non-negative [18]. More in Fig. 1a. For completion, the figure also shows the lat- precisely, by denoting the entropy function of Z as V tice of optimal partitions stated in the proposition, where the trivial partition {V} is indicated using a dashed line. h(B):=H(Z ) for B ⊆V, (4.1) Similarly, Fig. 1b shows the optimal partitions of Z . B {1,2,3} Since I(ZV)=0 and I(Z{1,2,3})=1, Proposition 3.2 asserts (where the dependency on ZV is implicit for convenience,) that C ={{1,2,3},{4,5}} and C ={{1,2}}, in agreement submodularity of h means that 0 1 with (2.2). Assuming an algorithm for computing the finest optimal partition, the divisive algorithm starts by computing h(B )+h(B )≥h(B ∪B )+h(B ∩B ) (4.2) 1 2 1 2 1 2 for all B1,B2 ⊆ V. h is also said to be normalized as hˆγ(V) 1 2 3 h(∅) = 0, and non-decreasing as h(B′) ≤ h(B) when- ever B′ ⊆ B ⊆ V. In combinatorial optimization [19], hγ[{{1,...,6}}]=4−γ 4 5 6 submodularity is well-known to give rise to polynomial-time solutions. The info-clustering problem, in particular, relies 0=I(Z{1,...,6}) on the following closely related polynomial-time solvable 1 2 3 structures. hγ[{{1,2,3},{4,5},{6}}]=4−3γ 4 5 6 1=I(Z{1,2,3}) A. Principal sequence of partitions =I(Z{4.5}) For any real number γ ∈ R, define the residual entropy function [2] of a random vector Z as γ V 1 2 3 hγ[{{1,2},{3},{4},{5},{6}}]=6−5γ h (B):=h(B)−γ for B ⊆V. (4.3) 4 5 6 γ The residual entropy function is also submodular and its 2=I(Z{1,2}) Dilworth truncation evaluated at V is defined as [19] 1 2 3 hˆ (V):= min h [P] where (4.4a) γ P∈Π(V) γ hγ[{{1},{2},{3},{4},{5},{6}}]=8−6γ 4 5 6 h [P]:= h (C). (4.4b) γ C∈P γ (a) hˆγ(V) (b) PSP X Fig. 2: Dilworth truncation Proposition 4.1 ([8]) Submodularity (4.2) of h (4.1) implies that the set of optimal partitions to the Dilworth truncation (4.4a) forms a lattice, called the Dilworth truncation B. Principal sequence and minimum norm base lattice, with respect to the partial order (3.6). Furthermore, if P′ and P′′ are the optimal partitions for γ′ and γ′′ Consideranysubmodularfunctionf :2U →R (4.2)onthe respectively, then γ′ <γ′′ implies P′ (cid:23)P′′. ✷ finite ground set U. For λ∈R, define In particular, the minimum/finest optimal partition exists and S (f):=maxargminf(B)−λ|B|. (4.6) λ characterizes the info-clustering solution as follows: B⊆V where max is the inclusion-wise maximum. Note that the Proposition 4.2 ([1, Corollary 2]) For a finite set V with size |V|>1 and a random vector Z , minimization in (4.6) is a submodular function minimization V (SFM) since the function B ⊆ U 7→ f(B) − λ|B| is also C (Z )= min{P ∈Π(V)|h [P]=hˆ (V)} submodular.Itiswell-knownthattheminimizersformalattice γ V γ γ (4.5) withrespecttosetinclusion[19],andsothemaximumin(4.6) h {{i}|i∈V}, i exists and is unique. namely, the non-single(cid:15)ton elements of the finest optimal par- Proposition 4.3 ([12, 14]) S (f) for λ∈R satisfies λ tition to the Dilworth truncation (4.4a). ✷ S (f)⊆S (f) iff λ′ ≤λ′′ λ′ λ′′ Hence,inProposition3.1,thesequenceofP forℓfrom1to ℓ N with the correspondingγ also characterizes the minimum and is referred to as the principal sequence (PS). ✷ ℓ optimalpartitionsto (4.4a)forall γ ∈R, and is knownas the Withoutloss of generality,we assume f is normalized,i.e., principalsequence of partitions(PSP) of h, introducedin [8]. f(∅) = 0, because we can redefine f as f −f(∅) without affecting the solutions to the SFM in (4.6), i.e. the PS is Example 2 Forthemotivatingexample,theDillworthtrunca- tionˆh (V)(4.4a)isshowninFig.2a.TheDillowrthtruncation invariant to constant shift in the submodular function. The γ is piecewise linear with at most |V| turning points. A turning polyhedronP(f)andbasepolyhedronB(f)ofthesubmodular point p := (γ ,ˆh (V)) occurs when (4.4a) has more than function f are defined as i i γi one solution. The collection of such optimal solutions is the P(f):={x ∈RU |x(B)≤f(B),∀B ⊆U} (4.7) Dillowrth truncation lattice at γ indicated in Proposition 4.1. U i (The Dillowrth truncation lattice is not shown in Fig. 2.) The B(f):={xU ∈P(f)|x(U)=f(U)} (4.8) finestoptimalpartitionatγ remains(whileallotherpartitions i where x := (x | i ∈ U) and x(B) := x for seieze to be) optimal until the next turning point. In other U i i∈B i convenience. (P(f) and B(f) are non-empty as f(∅) ≥ 0.) words, the finest optimal partition at γ determines the line P i Withkx kdenotingtheEuclideannormofthevectorx ,the U U segment that follows γ . The sequence of the finest optimal i following holds. partitions is the PSP, whch is shown in Fig. 2b. ✷ Proposition 4.4 ([12, 20]) For any normalized submodular ToexplainAlgorithm2,wefirstsimplifywhatFuseshould function f (4.2), compute as follows. Theorem 5.1 Consider P and γ defined as in Proposi- min{kx k|x ∈B(f)} (4.9) ℓ ℓ U U tion 3.1 for 1≤ℓ≤N and write has a unique solution x∗U, called the minimum (Euclidean) P ={C |j ∈U} and Z′ :=Z norm base, which satisfies ℓ j j Cj for some index set U and disjoint subset C for j ∈U. Then, j x∗i(f)=min{λ∈R|i∈Sλ(f)} ∀i∈U,or equiv., (4.10a) for 1≤ℓ≤N we have S (f)={i∈U |x∗(f)≤λ}, ∀λ∈R, (4.10b) λ i γ = max I(Z ) (5.1a) ℓ SF where the equivalencefollows directly from Proposition 4.3.✷ F⊆Pℓ,|F|>1 =I∗(Z′ ) (5.1b) U The minimum norm base may be computed using Wolfe’s minimumnormpointalgorithmasin[20],andsoasthePSby P \P =maximal F F ∈ argmaxI(Z ) (5.2a) ℓ−1 ℓ SF (4.10b). Conversely, by (4.10a), the minimum norm base can ([ (cid:12)(cid:12) F⊆Pℓ:|F|>1 ) aalnsyoλb,ebuctomthpeumteidnimbyumannyoSrmFMpoainlgtoarlgitohrmithtmhawt asoslsvheoswSnλ[2fo0]r = j∈BCj |B ∈(cid:12)(cid:12)(cid:12) C∗(Z′U) , (5.2b) empirically to perform well compared to other submodular where F :n=[ B for convenieonce. More precisely, B∈F function minimization algorithms. (5.1a) and (5.2a) follow more generally from the prop- S S erty (3.2), while (5.1b) and (5.2b) follow from the defini- V. MAIN RESULTS tion (3.7) of the MMI. ✷ Algorithm 2: Implementation of Fuse in Algorithm 1. PROOF See Appendix A. (cid:4) Itfollowsfrom(5.1b)and (5.2b)thatit sufficesto compute Data: P is equal to P for some 1≤ℓ≤N in ℓ I∗ (3.8) and C∗ (3.9). This can be done using a minimum Proposition 3.1. norm base algorithm as follows. Define Result: (γ,P′) is equal to (γ ,P ). ℓ ℓ−1 1 Enumerate P as {C1,...,Ck} for some k >1 and JT(ZV):=I{{i}|i∈V}(ZV) disjoint Ci’s; 1 (5.3) 2 x← empty array of size k; = |V|−1 H(Zi)−H(ZV) , " # 3 for j =1 to k do iX∈V 4 x[j]=MinNormBase (B 7→ which is called the normalized total correlation [2] as it is h C − h(C ),{j+1,...,k}); the same as Watanabe’s total correlation [21] except for the i∈B∪{j} i i∈B∪{j} i additional normalization factor of 1 . 5 end (cid:16)S (cid:17) P |V|−1 6 γ ←− min x[j][i], P′ ←∅; Theorem 5.2 With entropy function h (4.1) for ZV, define i,j:1≤i<j≤k 7 for j =1 to k do Uj :={i∈V |i>j} for j ∈V, assuming V is ordered. 8 if Cj 6∈ P′ then g (B):=h(B∪{j})− h({i}) for B ⊆U , j j 9 add {Cj}∪{Ci |i∈{j+1,...,k},x[j][i]≤−γ} to i∈B∪{j} P′; S which is a normalized sXubmodular function. Then, 10 end I∗(Z )= max J (Z ) (5.4a) 11 end V C⊆V:|C|>1 T C 12 function MinNormBase (f,U): =−minminx(j) (5.4b) 13 return an array x (indexed by U) that solves (4.9). j∈V i∈Uj i 14 end C∗(Z )=maximal argmax J (Z ) (5.5a) V T C C⊆V:|C|>1 The main result is the implementation of Fuse in Al- =maximal {j∗}∪argminx(j∗) i gorithm 2 that computes the PSP and therefore the info- ( i∈Uj∗ (cid:12)(cid:12) clustering solution iteratively from the finer partitions to the (cid:12) coarser ones. This is done by computing the PS using a j∗ ∈argmin min(cid:12)(cid:12)x(ij) , (5.5b) subroutineMinNormBasethatcomputesthe minimumnorm j∈V (cid:18)i∈Uj (cid:19)(cid:27) base that solves (4.9). An explicit implementation of this where x(Ujj) is the minimum norm base for gj (4.9). ✷ subroutinecanbefoundin[20]usingWolfe’sminimumnorm PROOF See Appendix A. (cid:4) point algorithm. The current abstraction also allows further approximations or simplifications for special source models (5.4a) and (5.5a) essentially eliminate the need for min- as in [1]. imization over partitions in calculating the MMI in (3.8) and (3.9). They serve as an intermediate step that leads to complexity of the minimum norm base algorithm for ground (5.4b) and (5.5b), which relate the last critical value (3.8) set of size l, then Fuse runs in time O(|V|MNP(|V|)). and the second last partition (3.9) to the minimum norm Since the agglomerative info-clustering algorithm in Algo- base. Together with Proposition 4.4, we have the complete rithm 1 invokes function Fuse N −1 ≤ |V|−1 times, it implementation of Fuse as shown in Algorithm 2 using a runs in time O(|V|2MNP(|V|)), which is equivalent to that minimum norm base algorithm. of [1, Algorithm 3], assuming that the submodular function minimization therein is implemented by the minimum norm Example 3 As an illustration of Theorem 5.2 (and the ag- base algorithm, i.e., with SFM = MNP. The divisive info- glomerative algorithm), consider our running example with clusteringalgorithmin[1,Algorithm2]makesN−1callstoa V ={1,...,6}. The minimum norm base x(j) of g is given Uj j subroutinethatcalculates the fundamentalpartition.However, as computing the fundamental partition appears to take time 1 2 3 4 5 6 O(|V|2MNP(|V|)), whichwouldleadto anoverallcomplex- ( −2, −1, −0.5, −0.5, 0), j =1 ity of O(|V|3MNP(|V|)) for the divisible clustering. Hence, ( −1, −0.5, −0.5, 0), j =2 the agglomerative info-clustering appears more efficient. x(j) = Uj (( −0.5, −−01.5,, 00)),, jj ==34 VI. CONCLUSION To address the concern of entropy estimation and compu- ( 0), j =5.  tcaltuisotnearilncgomapppleroxaitcyh,wtheathamveerpgreosposmseadllaenr acglugsltoemrserianttiovelainrgfoe-r By (5.4b), we have I∗(Z ) = 2 and by (5.5b), we have V clusterssuccessively.Tothe bestofourknowledge,thisis the C∗(Z ) = {{1,2}}. In other words, starting with the par- V fastest info-clustering algorithm without any approximation. tition into singletons in the agglomerative algorithm, the As mentioned in [1], however, faster algorithms are possible function Fuse returns the threshold value 2 and the partition underspecialsourcemodels,suchastheMarkovtreemodelor {{1,2},{3},{4},{5},{6}}. Let Z′ = Z and Z′ = Z 1 {1,2} i i+1 Chow–Liu tree approximation [22]. For the graphical source for i=2,...,5. The minimum norm base x(j) of g (defined using the entropy function of Z′ ) is giUvjen as j models, the PSP can be computed more efficiently using a {1,...,5} parametric maxflow algorithm as in [23]. The info-clustering 1 2 3 4 5 algorithm can also be used to compute the solution of some related problems such as the optimal discussion rate tuple for ( −1, −0.5, −0.5, 0), j =1 successive omniscience [24]. x(j) =( −0.5, −0.5, 0), j =2 Uj ( −1, 0), j =3 PROOFSAOPPFEMNADIINXRAESULTS ( 0), j =4  uPsRiOngOFth(eTHgeEnOeRraElMp5ro.1pe)rtFyir(s3t,.2)weinspteroadveof(5t.h1ea)praencdise(5d.e2fia-) By (5.4b), we have I∗(Z′ ) = 1 and by (5.5b), we {1,...,5} nition (3.7). have C∗(Z′ ) = {{1,2},{3,4}}, which in terms of Z , {1,...,5} V We first argue that, for any feasible solution F to the r.h.s. resultsintheclusters{{1,2,3},{4,5}}.Inotherwords,when of (5.1a), calledwiththeinputpartition{{1,2},{3},{4},{5},{6}},the (a) function Fuse returns the threshold value 1 and the partition I(ZSF)≤γℓ. {{1,2,3},{4,5},{6}}. Let Z′1′ = Z{1,2,3}, Z′2′ = Z{4,5}, and Supposeto the contrarythatI(ZSF)>γℓ. Then, there exists tZh′3e′ =enZtr6o.pyThfeunmctiinoinmoufmZn′′orm b)aissegxiv(Ujej)noafsgj (defined using Cclu⊇sters.FHoswucehvetrh,aCt C(Z∈ C)γ⊆ℓ(ZPV)bbyyPtrhoepodseifitinointio3n.1(,3w.1h)icohf {1,2,3} S γℓ V ℓ contradicts the fact that F ⊆P with |F|>1. 1 2 3 ℓ Next, we show that (a) can be achieved with equality for x(j) = ( 0 , 0 ), j =1 somefeasiblesolutionF. ConsideranyC ∈P \P .(Such Uj ( 0 ), j =2 ℓ−1 ℓ (cid:26) aC existssincePℓ isstrictly finerthanPℓ−1.)Then,wehave (b) By (5.4b), we have I∗(Z′′ ) = 0 and by (5.5b), we have C= F for some F ⊆Pℓ :|F|>1, {1,2,3} C∗(Z′{′1,2,3})={1,2,3}, which in terms of ZV, results in the i.e., for some fea[sible solution F. cluster {1,...,6}. In other words, when called with the input By Proposition 3.1, we have partition{{1,2,3},{4,5},{6}},thefunctionFusereturnsthe thresholdvalue0 and the trivialpartition{1,...,6}, whereat C ∈Cγ(ZV) for all γ ∈[γℓ−1,γℓ), this point the agglomerative algorithm terminates. ✷ and so I(Z )≥γ , i.e., larger than all values in the interval. C ℓ The complexity of the algorithm is mainly due to the The reverse inequality also holds by (a) and (b). Hence, we computation of the minimum norm base in line 4. This com- have putationisrepeatedatmost|V|times.WithMNP(l)beingthe I(Z )(=c)γ , C ℓ which implies (5.1a) as desired. since P is optimal to hˆ (V) (4.4) by construction. N γN Now, we argue that the above construction gives all the • To explain “≤” for (a), note that PN−1 is an op- optimal solutions to the r.h.s. of (5.1a), hence establishing timal partition to the Dilworth truncation (4.4a) for (5.2a). For any C ∈ P \ P , (b) and (c) implies that γ ∈ [γ ,γ ) by Proposition 4.2. By continuity of ℓ−1 ℓ N−1 N “⊆” holds for (5.2a), because the fact that C ∈ C (Z ) h [P ] (4.4b) with respect to γ, we have that P γℓ−1 V γ N−1 N−1 (by Proposition 3.1) means that it is maximal by the defini- is also optimal for γ =γ , i.e., N tion (3.1) of clusters. (c) Toarguethereverseinclusion“⊇”,consideranyF belong- hγN[PN]=hγN[PN−1], ing to the r.h.s. of (5.2a). By (5.1a), by the optimality of P . Hence, since N I(Z )=γ >γ SF ℓ ℓ−1 h ({i})= h ({i}), γN γN andso F ∈Cγℓ−1 by the definition(3.1)of clustersand the i∈XPN C∈XPN−1Xi∈C maximality of F. This completes the poof of (5.2a). we have, S Considerproving(5.1b)and(5.2b).Foranyoptimalsolution S F to (5.1a), we have h (C)− h ({i}) =0. P∗(ZSF)=F ={Cj |j ∈B} C∈PXN−1\PN" γN Xi∈C γN # forsomeB ⊆U,wherethefirstequalityisbyProposition3.2 Eachterminthebracketisnon-negativeasarguedbefore, since I(Z )>γ for all C ∈F such that C >1. Hence, and so must be equal to zero. C ℓ Consider any maximal solution C′ to the r.h.s. of (5.4a). I(ZSF)=IP∗(ZSF)(ZSF)=IF(ZSF) Then, =I{{j}|j∈B}(Z′B)≥I(Z′B) γ (≤d)J (Z ) or equivalently, N T C′ wheretheinequalityfollowsfromthefactthatpartition{{j}| (e) j ∈B} into singletons may not be the optimal partition of B hγN(C′)≤ hγN({i}), for I(Z′ ). The reverse inequality also holds because, for all i∈C′ B X P′ ∈Π′(B), define which follow from (b) and (a) respectively because the (5.4a) P := Cj |C′ ∈P′ ∈Π′ F , does not require C ∈PN−1\PN. With ( ) j[∈C′ (cid:16)[ (cid:17) P′ :={C′}∪{{i}|i∈V \C′}, we have I (Z′ ) = I (Z ) ≥ I(Z ). Here, the in- P′ B P SF SF equality follows from the fact that P may not be the optimal (e) implies partitionof ZSF. This completesthe proofof Theorem5.1.(cid:4) hγN[P′]≤hγN[PN], PROOF (THEOREM5.2) Consider γN,PN−1, and PN as in which must hold with equality by the optimality of PN. Proposition 3.1 and let hγ be as in (4.3). By Proposition 4.2, Therefore, (d) also holds with equality, and so we have we have for all C ∈PN−1\PN that (5.4a) since γN = I∗(ZV). Note that, by (c), PN−1 is also an optimal partition. Furthermore, it is the largest/coarsest (a) h (C)= h ({i}) or equivalently, γN γN such partition since it is optimal for some γ < γN and Xi∈C thereforelarger/coarserthan all optimalpartitionsfor γ =γN γ (=b)J (Z ). by Proposition 4.1. Therefore, the set of maximal optimal N T C solutionstother.h.s.of(5.4a)isP \P ,whichisC∗(Z ) • Theequivalencebetween(a)and(b)followsimmediately N−1 N V trivially, and so we have (5.5a). from the definition (5.3) of J . More precisely, (b) is T To prove (5.4b) and (5.5b), let γ∗ = I∗(Z ). Then, it equivalent to V follows from (5.4a) that 1 γN = h({i})−h(C) by (4.1) 0= min γ∗−JT(ZC) |C|−1" # C⊆V:|C|>1 i∈C ⇐⇒ (|C|−1)γ X= h({i})−h(C) ∵|C|>1 =(f) min (|C|−1)γ∗+h(C)− h({i}) N C⊆V:|C|>1 i∈C i∈C X ⇐⇒ hγN(C)= XhγN({i}) by (4.3) (=g)mj∈iVnB⊆Umj:i|nB|≥1gj(B)+γ∗|B|. i∈C X which is equivalent to (a) as desired. • (f) is obtained by the definition (5.3) of JT and multi- plying both sides of the equality by |C|−1 >0, which • “≥” for (a) follows from does neither violate the equality nor change the set of h {C}∪{{i}|i∈V \C} ≥h (P ) solutions; γN γN N (cid:2) (cid:3) • (g)isobtainedbychangingthevariableC to(j,B)using (j∗,B∗) is a solution to the r.h.s. of (g). By the inclusion- the bijection that sets wise maximality of C∗, the set B∗ is also a maximal (the maximum) solution, i.e., j :=mini and B :=C\{j}, i∈C B∗(=k)S (g ) −γ∗ j∗ which is possible because |C|>1. We have also applied ={i∈Uj∗ |i∈S−γ∗(gj∗)} |B| = |C| − 1 and the definition of gj to rewrite the =(l){i∈U |x(j∗) ≤−γ∗} expression in the minimization. j∗ i Consider any optimal solution j∗ to the R.H.S. of (g). Then, (=m)argminx(j∗), i we have i∈Uj∗ (h) where (k) is by (4.6); (l) is by (4.10b); and (m) is by (j). S (g ) 6=∅ λ=−γ∗ Hence, C∗ also belongs to the r.h.s. of (5.5b). λ j∗ =(i)∅ λ<−γ∗. Conversely, if (j∗,B∗) is an optimal solution to the r.h.s.  of (g), it can be argued easily that C∗ :={j}∗∪B∗ is also a • (h) is because, by (g), solutiontother.h.s.of(5.4a).ThemaximalsuchC∗ therefore belongs to the l.h.s. of (5.5b) as desired. This completes the ∅6=max argmin gj∗(B)+γ∗|B| proof. (cid:4) B⊆Uj∗:|B|≥1 =maxargming (B)+γ∗|B| REFERENCES j∗ B⊆Uj∗ [1] C. Chan, A. Al-Bashabsheh, Q. Zhou, T. Kaced, and T. Liu, “Info- =S (g ) clustering: A mathematical theory for data clustering,” IEEE Transac- −γ∗ j∗ tionsonMolecular,BiologicalandMulti-ScaleCommunications,vol.2, no.1,pp.64–91,June2016. The last equality is by the definition (4.6) of S . The −γ∗ [2] C. Chan, A. Al-Bashabsheh, J. Ebrahimi, T. Kaced, and T. Liu, first equalityis because allowing B =∅ does notchange “Multivariate mutual information inspired by secret-key agreement,” the minimum value 0, since Proceedings oftheIEEE,vol.103,no.10,pp.1883–1913, Oct2015. [3] C.Chan,“Ontightnessofmutualdependenceupperboundforsecret-key capacity ofmultiple terminals,” arXivpreprintarXiv:0805.3200, 2008. gj∗(∅)+γ∗|∅|=0 [4] I. Csisza´r and P. Narayan, “Secrecy capacities for multiple terminals,” IEEETrans.Inf.Theory,vol.50,no.12,Dec.2004. [5] C.ChanandL.Zheng,“Mutualdependenceforsecretkeyagreement,” as g (∅). Doing so also does not affect the maximum j∗ inProceedingsof44thAnnualConferenceonInformationSciencesand minimizer since the new optimal solution introduced, Systems,2010. namely ∅, cannot be maximum trivially. [6] C. Chan, “The hidden flow of information,” in Proc. IEEE Int. Symp. onInf.Theory,St.Petersburg,Russia,Jul.2011. • To explain (i), note that (g) implies for all B ⊆ Uj∗ : [7] ——,“Matroidal undirected network,” inInformationTheoryProceed- |B|>1 that ings (ISIT), 2012 IEEE International Symposium on, July 2012, pp. 1498–1502. g (B)+γ∗|B|≥0 and so [8] H. Narayanan, “The principal lattice of partitions of a submodular j∗ function,”LinearAlgebraanditsApplications, vol.144,no.0,pp.179 g (B)−λ|B|≥(−λ−γ∗)|B| ∀λ∈R –216,1990. j∗ [9] R.W.Yeung,InformationTheoryandNetworkCoding. Springer,2008. >0 ∀λ<−γ∗. [10] K. Nagano, Y. Kawahara, and S. Iwata, “Minimum average cost clustering.” in NIPS, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, Inotherwords,forλ<−γ∗,wehavethat∅istheunique R.S.Zemel, andA.Culotta, Eds. Curran Associates, Inc.,2010, pp. solution to min g (B)+γ∗|B|, which implies (i) 1759–1767. B⊆Uj∗ j∗ [11] Y. Wu and P. Yang, “Minimax rates of entropy estimation on large by the definition (4.6) of Sλ. alphabetsviabestpolynomialapproximation,”IEEETrans.Inf.Theory, Now, (h) and (i) implies that vol.62,no.6,pp.3702–3720, 2016. [12] S. Fujishige, “Lexicographically optimal base of a polymatroid with respecttoaweightvector,”MathematicsofOperationsResearch,vol.5, −γ∗ =sup{λ∈R|S (g )=∅} λ j∗ no.2,pp.186–196,1980. = min min λ [13] ——,Submodularfunctionsandoptimization, 2nded. Elsevier,2005. i∈Uj∗λ∈R:i∈Sλ(gj∗) [14] ——, “Theory of principal partitions revisited,” in Research Trends in Combinatorial Optimization. Springer, 2009,pp.127–162. =(j) min x(j∗) [15] C.Chan,A.Al-Bashabsheh,Q.Zhou,andT.Liu,“Dualitybetweenfea- i i∈Uj∗ tureselection anddataclustering,” in54thAnnualAllertonConference on Communication, Control, and Computing, Allerton Retreat Center, where the last equality (j) is by (4.10a) since x(j∗) denotes Monticello, Illinois, 2016. Uj∗ [16] P.NarayanandH.Tyagi,“Multiterminal secrecybypublicdiscussion,” the minimum norm base for g . The equality implies (5.4b) Foundations and Trends in Communications and Information Theory, j∗ as desired. vol.13,no.2-3,pp.129–275,2016. [17] S. Fujishige, “Polymatroidal dependence structure of a set of random To prove (5.5b), consider any set C∗ that belongs to the variables,” InformationandControl, vol.39,no.1,pp.55–72,1978. r.h.s. of (5.5a). Applying the bijection [18] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol.27,no.3,pp.379–423,July1948. [19] A. Schrijver, Combinatorial Optimization: Polyhedra and Efficiency. j∗ := mini and B∗ :=C∗\{j∗}, Springer, 2002. i∈C∗ [20] S. Fujishige and S. Isotani, “A submodular function minimization algorithm based on the minimum-norm base,” Pacific Journal of Op- timization, vol.7,no.1,pp.3–17,2011. [21] S. Watanabe, “Information theoretical analysis of multivariate correlation,” IBM Journal of Research and Development, vol. 4, no. 1, pp. 66–82,1960. [22] C. Chan and T. Liu, “Clustering of random variables by multivariate mutual information on Chow-Liu tree approximations,” in Fifty-Third AnnualAllertonConference onCommunication, Control, andComput- ing,Allerton Retreat Center, Monticello, Illinois, Sep.2015. [23] V. Kolmogorov, “A faster algorithm for computing the principal se- quenceofpartitions ofagraph,”Algorithmica,vol.56,no.4,pp.394– 412,2010. [24] C.Chan,A.Al-Bashabsheh,Q.Zhou,N.Ding,T.Liu,andA.Sprintson, “Successive omniscience,” IEEETrans.Inf. Theory, vol. 62,no. 6,pp. 3270–3289, June2016.

Agglomerative Info-Clustering PDF

0.22 MB·

by Chung Chan

#journals #arxiv

Checking for file health...

Save to my drive

Quick download

Download

Download Agglomerative Info-Clustering PDF Free - Full Version

by Chung Chan| 0.22

Download Agglomerative Info-Clustering by Chung Chan in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Agglomerative Info-Clustering

No description available for this book.

Detailed Information

Author:	Chung Chan
File Size:	0.22
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Agglomerative Info-Clustering Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Agglomerative Info-Clustering PDF?

Yes, on https://PDFdrive.to you can download Agglomerative Info-Clustering by Chung Chan completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Agglomerative Info-Clustering on my mobile device?

After downloading Agglomerative Info-Clustering PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Agglomerative Info-Clustering?

Yes, this is the complete PDF version of Agglomerative Info-Clustering by Chung Chan. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Agglomerative Info-Clustering PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.