Table Of ContentAgglomerative Info-Clustering
Chung Chan, Ali Al-Bashabsheh and Qiaoqiao Zhou
Abstract—An agglomerative clustering of random variables is search for an iterative info-clustering algorithm. A divisive
proposed, where clusters of random variables sharing the max- clustering approach was proposed in [1, Algorithms 1 and 2]
imum amount of multivariate mutual information are merged
thatbreaksdownthecomputationbysplittingtheentiresetof
successively to form larger clusters. Compared to the previous
RVs successively into increasingly smaller clusters. However,
info-clusteringalgorithms,theagglomerativeapproachallowsthe
computation to stop earlier when clusters of desired size and doing so appears to be inefficient, requiring Ω(m3SFM(m))
accuracy are obtained. An efficient algorithm is also derived time in the worst case. Furthermore, it computes the larger
7
based on the submodularity of entropy and the duality between clusters first, the entropy function of which is more difficult
1
the principal sequence of partitions and the principal sequence
0 toestimatefromdata,andsotheerrormaybecarriedforward
for submodular functions.
2 to subsequent computations of smaller clusters.
b I. INTRODUCTION In this work, we propose an agglomerative info-clustering
e approach that aims to resolve the above issues. The idea is
F Weconsidertheinfo-clusteringparadigmproposedin[1].It to start with smaller clusters first and merge them to form
is a hierarchical clustering of a finite set of random variables
4 largerclusterssuccessively.It turnsoutthat the algorithmcan
(RVs) based on the multivariate mutual information (MMI)
2 be implemented more efficiently than the divisive approach,
defined in [2]. The MMI is a natural extension of Shannon’s by relating the PSP to another structure called the principal
] mutualinformationto themultivariatecaseinvolvingpossibly
T sequence (PS) [12–14]. A similar duality between the PSP
more than two RVs. It was first proposed in [3] as a measure
I and PS was also used in [15] to relate info-clustering to the
s. of mutual information after identifying the divergence upper problem of feature selection. The contribution of this work is
c bound of the secret key agreement problem [4] to be loose thederivationofarigorousinformation-theoreticinterpretation
[ in the case with helper but tight in the no-helper case, which
useful for the clustering problem, based on which further
establishedanoperationalmeaningoftheMMIasthe secrecy
3 heuristics for estimation, approximation, or model reduction
v capacity [5]. The MMI was also shown to be equal to the as in [1] can be developed.
6 undirectednetworkcodingthroughput[6] underthematroidal
92 undirected network link model [7]. II. MOTIVATION
The info-clustering solution was shown in [1] to coincide
4
Before a rigorous and general treatment, we introduce the
with an elegant mathematical structure called the principle
0
problem and results informally using the following example
. sequence of partitions (PSP) [8] of a submodular function,
1 from[1,Figure1a]:ConsiderthefollowingRVsdefinedusing
namely,thatoftheentropyfunction[9]oftheRVs tobeclus-
0 the uniform and independent bits X ,X ,X and X
7 tered. This leads to an algorithm [1, Algorithm 3] similar to a b c d
1 [10] that computesthe clusteringsolution in O(m2SFM(m)) Z :=(X ,X ), Z :=(X ,X ), Z :=X ,
v: time, where SFM(m) is the time required to minimize a Z1 :=X a, d Z2 :=X a, d Z3 :=Xa. (2.1)
4 b 5 b 6 c
i submodular function on a ground set of size m. In practice,
X
however; 1) one may want to obtain clusters of a desired Info-clustering[1] is a hierarchical clustering approach based
r size rather than the entire hierarchy of clusters of different on a multivariate measure of the information shared among
a
sizes; and 2) the entropy function needs to be estimated (multiple) RVs. As will be discussed more precisely in a
from data, which can be difficult for a large set of random subsequent section, info-clustering provides clusters for dif-
variables [11]. These practical considerations motivate the ferent thresholds γ ∈ R, where a cluster is an inclusion-wise
maximal subset of RVs that share more than γ amount of
C.Chan(email:[email protected]), A.Al-Bashabsheh,andQ.Zhou information.(We donotregardasingletonasa cluster.)Since
arewiththeInstitute ofNetwork Coding atthe Chinese University ofHong
the correlation structure of the RVs in (2.1) is simple, let us
Kong, the Shenzhen Key Laboratory of Network Coding Key Technology
andApplication, China, andtheShenzhen Research Institute oftheChinese for the moment define such a measure of information among
University ofHongKong. Z , for any B ⊆ {1,...,6} with |B| ≥ 2, as the number of
B
TheworkdescribedinthispaperwassupportedbyagrantfromUniversity bits shared by Z and denote it by I(Z ). Then we have
Grants Committee of the Hong Kong Special Administrative Region, China B B
(ProjectNo.AoE/E-02/08),andsupportedpartiallybyagrantfromShenzhen • I(Z{1,...,6}) = 0 since, e.g., Z{1,...,5} and Z6 share no
ScienceandTechnologyInnovationCommittee(JSGG20160301170514984), bits. (Similarly,I(Z )=0 for ∅=6 B ⊆{1,··· ,5};
theChinese University ofHongKong(Shenzhen), China. B∪{6}
The work of C. Chan was supported in part by The Vice-Chancellor’s I(Z{1,4})=0; etc.)
One-offDiscretionaryFundofTheChineseUniversityofHongKong(Project • I(Z{1,2,3}) = 1 since Z1, Z2, and Z3 share the bit Xa.
Nos.VCF2014030andVCF2015007),andagrantfromtheUniversityGrants (Similarly, I(Z ) = 1 = I(Z ).) We also have
CommitteeoftheHongKongSpecialAdministrative Region,China(Project {1,3} {2,3}
No.14200714). I(Z{4,5})=1 since Z4 and Z5 share the bit Xb.
• I(Z{1,2})=2 since Z1 andZ2 share the two bitsXa and finding clusters of smaller amount of shared bits. Moreover,
X . (This is the only set of RVs sharing more than one as we will subsequently show, when the information measure
d
bit.) ischosentobethe MMIin[2],theagglomerativeapproachis
• There is no set of RVs that share more than two bits. computationally more efficient compared to the divisive one.
Hence, for all γ ∈R, the collection of clusters at thresholdγ,
III. PROBLEM FORMULATION
denoted as C (Z ), is given by [1, Figure 1b]
γ {1,...,6} Given a random vector Z := (Z | i ∈ V) where V is an
V i
{{1,...,6}}, γ <0 ordered finite set of |V| > 1 RVs. The set of clusters at any
{{1,2,3},{4,5}}, γ ∈[0,1) threshold γ ∈R is defined in [1] as
Cγ(Z{1,...,6})= {{1,2}}, γ ∈[1,2) (2.2)
∅, γ ≥2. Cγ(ZV):=maximal{B ⊆V ||B|>1,I(ZB)>γ}, (3.1)
For instance, for any γ <0, there is onlyone cluster, namely, where maximalF := {B ∈ F | ∄B′ ∈ F,B ( B′}
denotes the collection of inclusion-wise maximal elements
the entire set of RVs since any subset containing at least
of any collection F of subsets, and I(Z ) is a multivariate
two RVs satisfies the threshold constraint and the entire set B
information measure satisfying
{1,...,6} is trivially the maximal one. For γ = 0, the
threshold constraint dictates that we seek collections of RVs I(Z )≥min{I(Z ),I(Z )} (3.2)
that share a strictly positive number of bits, i.e., one or two B1∪B2 B1 B2
bits in this example. Of such sets, it is easy to verify that the for all Bi ⊆ V with |Bi| > 1, i ∈ {1,2} and B1 ∩B2 6=
maximalones are {1,2,3}and {4,5}.This remains to be the ∅. It was shown in [1, Theorem 3] that the clusters form a
caseforanyγ ∈[0,1).Forγ =1,weseekcollectionsofRVs laminar family, i.e., for any γ′ ≤ γ′′, C′ ∈ Cγ′(ZV), and
thatsharemorethanonebits,i.e.,twobitsinthisexample.The C′′ ∈Cγ′′(ZV), we have
set {1,2} is the only such set, and so, it is trivially maximal.
C′∩C′′ =∅ or C′ ⊇C′′. (3.3)
This remains to be the case for γ ∈[1,2). Finally, for γ ≥2,
we have no clusters since no set of RVs share more than two In particular, clusters at the same threshold must be disjoint.
bits of information. Consequently, the clustering solution can be characterized as
The reader may have observed the following hierarchical follows by set partitions of V, the collection of which is
structure: Starting with the cluster {1,...,6} at sufficiently denoted as Π(V).
small threshold γ, the cluster breaks into two smaller clusters
Proposition 3.1 ([1, Theorems 1 and 4]) Theproperty(3.2)
{1,2,3} and {4,5}, where each is a maximal subset that
implies that the clustering solution (3.1) satisfies
has more shared information bits than the original cluster.
Continuing in this fashion, the cluster {1,2,3} breaks into Cγ(ZV)=Pℓ\{{i}|i∈V} for γ ∈[γℓ,γℓ+1],0≤ℓ≤N
the smaller cluster {1,2}, at which point no further breakage where N is a positive integer; γ := −∞ and γ := ∞
0 N+1
is possible and all clusters have been found. More gener-
for convenience;
ally, under a proper choice of the multivariate information
measure, the hierarchical structure of the clusters persists, −∞<γ1 <···<γN <∞ (3.4)
thereby allowing a divisive algorithm for finding the clusters
is a sequence of distinct critical values from R (consisting of
[1, Algorithm 1]. The algorithm starts with the entire set of
the thresholds at which the set of clusters changes); and
RVsthenproceedsiterativelytobreaktheclustersintosmaller
and smaller disjoint clusters. P ={V}≻P ≻···≻P ≻P ={{i}:i∈V} (3.5)
0 1 N−1 N
Inthis work,we proposethe reverseprocedureforcomput-
isasequenceofincreasinglyfinerpartitionsofV fromΠ(V),
ing the clusters. Namely, an agglomerative approach where
where P′ (cid:23)P means that
RVs gradually group into larger and larger clusters. In our
example, starting with the singletons for sufficiently large ∀C ∈P,∃C′ ∈P′ :C ⊆C′, (3.6)
value of γ, merge {1} and {2} into a cluster since such
and“≻”denotesthestrictinequality(i.e.,theinclusionabove
a merging results in the maximal subset {1,2} with the
is strict for at least one C ∈P). ✷
maximum number of shared bits. Continue to merge {1,2}
with{3}andmerge{4}with{5}toformthemaximalsubsets With the laminar structure, a divisive clustering algorithm
{1,2,3} and {4,5} with the second largest number of shared wasgivenin[1]tocomputeγℓ andPℓ byiterativelyproducing
bits. Continue in the same way to merge {1,2,3}, {4,5}, finer partitions from coarser ones, i.e., from ℓ=1 to ℓ=N.
and {6} into the cluster {1,...,6}, at which point no further The work in [2] considered a particularly meaningful multi-
merging is possible and all clusters have been found. variate information measure, namely, the multivariate mutual
The practicality of an agglomerative approach in compari- information (MMI) defined as
son to a divisive one is that the former identifies clusters of
I(Z ):= min I (Z ), where (3.7a)
larger amount of shared bits first. A larger amount of shared V P V
P∈Π′(V)
bits can be estimated from data more accurately, and so it 1
I (Z ):= H(Z )−H(Z ) (3.7b)
serves as a stronger base for the subsequent estimation for P V C V
|P|−1 C∈P
hX i
P∗(Z )={{1,2,3},{4,5},{6}}(fromwhichI(Z )isread-
1 2 3 V V
ily available), declares the non-singleton elements as clusters
4 5 6 at threshold I(Z ), and proceeds iteratively by picking any
V
cluster (of size larger than two) and computing its finest
1 2 3 1 2 3 1 2 3 optimalpartition,etc.Inourexample,thereisonlyonecluster,
4 5 6 4 5 6 4 5 6 the set {1,2,3} at threshold 0. The finest optimal partition of
Z isshowninFig.1b,whichresultsinthecluster{1,2}
1 2 3 at{1th,2r,e3s}holdI(Z )=1. After this, the divisive algorithm
1 2 3 {1,2,3}
terminates in this example. ✷
4 5 6 1 2 3
(a) (b) Algorithm 1: Agglomerative info-clustering.
Data: Statistics of Z sufficient for calculating the
Fig.1:Optimalpartitionsof:(a)I(Z )and(b)I(Z ).In V
V {1,2,3} entropy function h(B) for B ⊆V :={1,...,n}.
each case, the fundamentalpartition is the one at the bottom, Result: The arrays L and PSP contain the γ ’s and P ’s
ℓ ℓ
where the associated clusters are circled with thick lines.
in Proposition 3.1. More precisely, for
1≤ℓ≤N, the entries of the arrays PSP and L
are PSP[s]=P (Z ) and L[s]=γ (Z ) for
and Π′(V):=Π(V)\{{V}} is the set of all partitions of V ℓ V ℓ V
s=|P (Z )|, and are null otherwise.
intoatleasttwonon-emptydisjointsubsets.TheMMIwasfirst ℓ V
proposed as a measure of mutual information in [3] and was 1 L, PSP← empty arrays, each of length n;
later shown in [2] to satisfy (3.2), together with various other 2 PSP[n] ←{{i}:i∈V}, s←n;
properties that naturally extend those of Shannon’s mutual 3 while s>1 do
information to the multivariate case. The MMI was also later 4 (γ,P′)← Fuse(PSP[l]);
referredtoin[16]as“sharedinformation.”Animportantresult 5 L[s]←γ;
that inspired the info-clustering paradigm [1] is: 6 s←|P′|;
7 PSP[s]←P′;
Proposition 3.2 ([2, Theorems 5.2 and 5.3]) The optimal
8 end
partitionsachievingthe MMI in (3.7)togetherwith the trivial
partition {V} form a lattice w.r.t. (3.6). The minimum/finest
optimal partition, denoted by P∗(Z ), satisfies Instead of the divisive approach, we consider here an
V
agglomerative approach shown in Algorithm 1 that computes
P∗(ZV)\{{i}|i∈V}=CI(ZV)(ZV), γℓ’s and Pℓ’s iteratively from ℓ=N down to ℓ=1. We will
where C is defined in (3.1) using the MMI. In particular, give an efficient implementation of the subroutine Fuse in
γ
γ1 =I(ZV) and P1 =P∗(ZV) in Proposition 3.1. ✷ Algorithm 2 that computes γℓ and Pℓ−1 from Pℓ for any ℓ.
In particular, we will show that it suffices to compute
Together with the laminar structure in Proposition 3.1, it can
be argued that any algorithm for computing P∗(ZV) can I∗(ZV):=max{I(ZB)|B ⊆V,|B|>1} and (3.8)
be applied iteratively for divisive info-clustering as in [1,
C∗(Z ):=maximal{B ⊆V ||B|>1,
Algorithm 2]. However, doing so appears to be less efficient V (3.9)
compared to the alternative approach in [1, Algorithm 3] I(ZB)=I∗(ZV)},
that computes all the clusters but not in any particular order.
which are clearly the last critical value γ and the non-
N
Nevertheless, in practice it is desirable to have an iterative
singleton elements of the second last partition P respec-
N−1
approach that can stop when further computation is not of
tively.
interest or is meaningless due to errors in estimating or
approximating the entropies from data.
IV. PRELIMINARIES
Example 1 As an illustration of Proposition 3.2 and the
The ability to compute info-clustering solution efficiently
divisiveapproach,considerourmotivatingexample.Thefinest
stemsfromthe submodularityof entropy[17], or equivalently
optimal partition P∗(Z ) = {{1,2,3},{4,5},{6}} is shown
V the fact that mutual information is non-negative [18]. More
in Fig. 1a. For completion, the figure also shows the lat- precisely, by denoting the entropy function of Z as
V
tice of optimal partitions stated in the proposition, where
the trivial partition {V} is indicated using a dashed line. h(B):=H(Z ) for B ⊆V, (4.1)
Similarly, Fig. 1b shows the optimal partitions of Z . B
{1,2,3}
Since I(ZV)=0 and I(Z{1,2,3})=1, Proposition 3.2 asserts (where the dependency on ZV is implicit for convenience,)
that C ={{1,2,3},{4,5}} and C ={{1,2}}, in agreement submodularity of h means that
0 1
with (2.2). Assuming an algorithm for computing the finest
optimal partition, the divisive algorithm starts by computing h(B )+h(B )≥h(B ∪B )+h(B ∩B ) (4.2)
1 2 1 2 1 2
for all B1,B2 ⊆ V. h is also said to be normalized as hˆγ(V) 1 2 3
h(∅) = 0, and non-decreasing as h(B′) ≤ h(B) when-
ever B′ ⊆ B ⊆ V. In combinatorial optimization [19], hγ[{{1,...,6}}]=4−γ 4 5 6
submodularity is well-known to give rise to polynomial-time
solutions. The info-clustering problem, in particular, relies 0=I(Z{1,...,6})
on the following closely related polynomial-time solvable
1 2 3
structures. hγ[{{1,2,3},{4,5},{6}}]=4−3γ
4 5 6
1=I(Z{1,2,3})
A. Principal sequence of partitions
=I(Z{4.5})
For any real number γ ∈ R, define the residual entropy
function [2] of a random vector Z as γ
V 1 2 3
hγ[{{1,2},{3},{4},{5},{6}}]=6−5γ
h (B):=h(B)−γ for B ⊆V. (4.3) 4 5 6
γ
The residual entropy function is also submodular and its 2=I(Z{1,2})
Dilworth truncation evaluated at V is defined as [19]
1 2 3
hˆ (V):= min h [P] where (4.4a)
γ P∈Π(V) γ hγ[{{1},{2},{3},{4},{5},{6}}]=8−6γ 4 5 6
h [P]:= h (C). (4.4b)
γ C∈P γ (a) hˆγ(V) (b) PSP
X
Fig. 2: Dilworth truncation
Proposition 4.1 ([8]) Submodularity (4.2) of h (4.1) implies
that the set of optimal partitions to the Dilworth trunca-
tion (4.4a) forms a lattice, called the Dilworth truncation
B. Principal sequence and minimum norm base
lattice, with respect to the partial order (3.6). Furthermore,
if P′ and P′′ are the optimal partitions for γ′ and γ′′ Consideranysubmodularfunctionf :2U →R (4.2)onthe
respectively, then γ′ <γ′′ implies P′ (cid:23)P′′. ✷ finite ground set U. For λ∈R, define
In particular, the minimum/finest optimal partition exists and S (f):=maxargminf(B)−λ|B|. (4.6)
λ
characterizes the info-clustering solution as follows: B⊆V
where max is the inclusion-wise maximum. Note that the
Proposition 4.2 ([1, Corollary 2]) For a finite set V with
size |V|>1 and a random vector Z , minimization in (4.6) is a submodular function minimization
V
(SFM) since the function B ⊆ U 7→ f(B) − λ|B| is also
C (Z )= min{P ∈Π(V)|h [P]=hˆ (V)} submodular.Itiswell-knownthattheminimizersformalattice
γ V γ γ
(4.5) withrespecttosetinclusion[19],andsothemaximumin(4.6)
h {{i}|i∈V}, i exists and is unique.
namely, the non-single(cid:15)ton elements of the finest optimal par- Proposition 4.3 ([12, 14]) S (f) for λ∈R satisfies
λ
tition to the Dilworth truncation (4.4a). ✷
S (f)⊆S (f) iff λ′ ≤λ′′
λ′ λ′′
Hence,inProposition3.1,thesequenceofP forℓfrom1to
ℓ
N with the correspondingγ also characterizes the minimum and is referred to as the principal sequence (PS). ✷
ℓ
optimalpartitionsto (4.4a)forall γ ∈R, and is knownas the
Withoutloss of generality,we assume f is normalized,i.e.,
principalsequence of partitions(PSP) of h, introducedin [8].
f(∅) = 0, because we can redefine f as f −f(∅) without
affecting the solutions to the SFM in (4.6), i.e. the PS is
Example 2 Forthemotivatingexample,theDillworthtrunca-
tionˆh (V)(4.4a)isshowninFig.2a.TheDillowrthtruncation invariant to constant shift in the submodular function. The
γ
is piecewise linear with at most |V| turning points. A turning polyhedronP(f)andbasepolyhedronB(f)ofthesubmodular
point p := (γ ,ˆh (V)) occurs when (4.4a) has more than function f are defined as
i i γi
one solution. The collection of such optimal solutions is the
P(f):={x ∈RU |x(B)≤f(B),∀B ⊆U} (4.7)
Dillowrth truncation lattice at γ indicated in Proposition 4.1. U
i
(The Dillowrth truncation lattice is not shown in Fig. 2.) The B(f):={xU ∈P(f)|x(U)=f(U)} (4.8)
finestoptimalpartitionatγ remains(whileallotherpartitions
i where x := (x | i ∈ U) and x(B) := x for
seieze to be) optimal until the next turning point. In other U i i∈B i
convenience. (P(f) and B(f) are non-empty as f(∅) ≥ 0.)
words, the finest optimal partition at γ determines the line P
i Withkx kdenotingtheEuclideannormofthevectorx ,the
U U
segment that follows γ . The sequence of the finest optimal
i following holds.
partitions is the PSP, whch is shown in Fig. 2b. ✷
Proposition 4.4 ([12, 20]) For any normalized submodular ToexplainAlgorithm2,wefirstsimplifywhatFuseshould
function f (4.2), compute as follows.
Theorem 5.1 Consider P and γ defined as in Proposi-
min{kx k|x ∈B(f)} (4.9) ℓ ℓ
U U
tion 3.1 for 1≤ℓ≤N and write
has a unique solution x∗U, called the minimum (Euclidean) P ={C |j ∈U} and Z′ :=Z
norm base, which satisfies ℓ j j Cj
for some index set U and disjoint subset C for j ∈U. Then,
j
x∗i(f)=min{λ∈R|i∈Sλ(f)} ∀i∈U,or equiv., (4.10a) for 1≤ℓ≤N we have
S (f)={i∈U |x∗(f)≤λ}, ∀λ∈R, (4.10b)
λ i γ = max I(Z ) (5.1a)
ℓ SF
where the equivalencefollows directly from Proposition 4.3.✷ F⊆Pℓ,|F|>1
=I∗(Z′ ) (5.1b)
U
The minimum norm base may be computed using Wolfe’s
minimumnormpointalgorithmasin[20],andsoasthePSby P \P =maximal F F ∈ argmaxI(Z ) (5.2a)
ℓ−1 ℓ SF
(4.10b). Conversely, by (4.10a), the minimum norm base can ([ (cid:12)(cid:12) F⊆Pℓ:|F|>1 )
aalnsyoλb,ebuctomthpeumteidnimbyumannyoSrmFMpoainlgtoarlgitohrmithtmhawt asoslsvheoswSnλ[2fo0]r = j∈BCj |B ∈(cid:12)(cid:12)(cid:12) C∗(Z′U) , (5.2b)
empirically to perform well compared to other submodular where F :n=[ B for convenieonce. More precisely,
B∈F
function minimization algorithms. (5.1a) and (5.2a) follow more generally from the prop-
S S
erty (3.2), while (5.1b) and (5.2b) follow from the defini-
V. MAIN RESULTS
tion (3.7) of the MMI. ✷
Algorithm 2: Implementation of Fuse in Algorithm 1. PROOF See Appendix A. (cid:4)
Itfollowsfrom(5.1b)and (5.2b)thatit sufficesto compute
Data: P is equal to P for some 1≤ℓ≤N in
ℓ I∗ (3.8) and C∗ (3.9). This can be done using a minimum
Proposition 3.1.
norm base algorithm as follows. Define
Result: (γ,P′) is equal to (γ ,P ).
ℓ ℓ−1
1 Enumerate P as {C1,...,Ck} for some k >1 and JT(ZV):=I{{i}|i∈V}(ZV)
disjoint Ci’s; 1 (5.3)
2 x← empty array of size k; = |V|−1 H(Zi)−H(ZV) ,
" #
3 for j =1 to k do iX∈V
4 x[j]=MinNormBase (B 7→ which is called the normalized total correlation [2] as it is
h C − h(C ),{j+1,...,k}); the same as Watanabe’s total correlation [21] except for the
i∈B∪{j} i i∈B∪{j} i additional normalization factor of 1 .
5 end (cid:16)S (cid:17) P |V|−1
6 γ ←− min x[j][i], P′ ←∅; Theorem 5.2 With entropy function h (4.1) for ZV, define
i,j:1≤i<j≤k
7 for j =1 to k do Uj :={i∈V |i>j} for j ∈V, assuming V is ordered.
8 if Cj 6∈ P′ then g (B):=h(B∪{j})− h({i}) for B ⊆U ,
j j
9 add {Cj}∪{Ci |i∈{j+1,...,k},x[j][i]≤−γ} to i∈B∪{j}
P′; S which is a normalized sXubmodular function. Then,
10 end
I∗(Z )= max J (Z ) (5.4a)
11 end V C⊆V:|C|>1 T C
12 function MinNormBase (f,U): =−minminx(j) (5.4b)
13 return an array x (indexed by U) that solves (4.9). j∈V i∈Uj i
14 end C∗(Z )=maximal argmax J (Z ) (5.5a)
V T C
C⊆V:|C|>1
The main result is the implementation of Fuse in Al- =maximal {j∗}∪argminx(j∗)
i
gorithm 2 that computes the PSP and therefore the info- ( i∈Uj∗ (cid:12)(cid:12)
clustering solution iteratively from the finer partitions to the (cid:12)
coarser ones. This is done by computing the PS using a j∗ ∈argmin min(cid:12)(cid:12)x(ij) , (5.5b)
subroutineMinNormBasethatcomputesthe minimumnorm j∈V (cid:18)i∈Uj (cid:19)(cid:27)
base that solves (4.9). An explicit implementation of this where x(Ujj) is the minimum norm base for gj (4.9). ✷
subroutinecanbefoundin[20]usingWolfe’sminimumnorm
PROOF See Appendix A. (cid:4)
point algorithm. The current abstraction also allows further
approximations or simplifications for special source models (5.4a) and (5.5a) essentially eliminate the need for min-
as in [1]. imization over partitions in calculating the MMI in (3.8)
and (3.9). They serve as an intermediate step that leads to complexity of the minimum norm base algorithm for ground
(5.4b) and (5.5b), which relate the last critical value (3.8) set of size l, then Fuse runs in time O(|V|MNP(|V|)).
and the second last partition (3.9) to the minimum norm Since the agglomerative info-clustering algorithm in Algo-
base. Together with Proposition 4.4, we have the complete rithm 1 invokes function Fuse N −1 ≤ |V|−1 times, it
implementation of Fuse as shown in Algorithm 2 using a runs in time O(|V|2MNP(|V|)), which is equivalent to that
minimum norm base algorithm. of [1, Algorithm 3], assuming that the submodular function
minimization therein is implemented by the minimum norm
Example 3 As an illustration of Theorem 5.2 (and the ag-
base algorithm, i.e., with SFM = MNP. The divisive info-
glomerative algorithm), consider our running example with
clusteringalgorithmin[1,Algorithm2]makesN−1callstoa
V ={1,...,6}. The minimum norm base x(j) of g is given
Uj j subroutinethatcalculates the fundamentalpartition.However,
as
computing the fundamental partition appears to take time
1 2 3 4 5 6 O(|V|2MNP(|V|)), whichwouldleadto anoverallcomplex-
( −2, −1, −0.5, −0.5, 0), j =1 ity of O(|V|3MNP(|V|)) for the divisible clustering. Hence,
( −1, −0.5, −0.5, 0), j =2 the agglomerative info-clustering appears more efficient.
x(j) =
Uj (( −0.5, −−01.5,, 00)),, jj ==34 VI. CONCLUSION
To address the concern of entropy estimation and compu-
( 0), j =5.
tcaltuisotnearilncgomapppleroxaitcyh,wtheathamveerpgreosposmseadllaenr acglugsltoemrserianttiovelainrgfoe-r
By (5.4b), we have I∗(Z ) = 2 and by (5.5b), we have
V clusterssuccessively.Tothe bestofourknowledge,thisis the
C∗(Z ) = {{1,2}}. In other words, starting with the par-
V fastest info-clustering algorithm without any approximation.
tition into singletons in the agglomerative algorithm, the
As mentioned in [1], however, faster algorithms are possible
function Fuse returns the threshold value 2 and the partition
underspecialsourcemodels,suchastheMarkovtreemodelor
{{1,2},{3},{4},{5},{6}}. Let Z′ = Z and Z′ = Z
1 {1,2} i i+1 Chow–Liu tree approximation [22]. For the graphical source
for i=2,...,5. The minimum norm base x(j) of g (defined
using the entropy function of Z′ ) is giUvjen as j models, the PSP can be computed more efficiently using a
{1,...,5} parametric maxflow algorithm as in [23]. The info-clustering
1 2 3 4 5 algorithm can also be used to compute the solution of some
related problems such as the optimal discussion rate tuple for
( −1, −0.5, −0.5, 0), j =1
successive omniscience [24].
x(j) =( −0.5, −0.5, 0), j =2
Uj ( −1, 0), j =3 PROOFSAOPPFEMNADIINXRAESULTS
( 0), j =4
uPsRiOngOFth(eTHgeEnOeRraElMp5ro.1pe)rtFyir(s3t,.2)weinspteroadveof(5t.h1ea)praencdise(5d.e2fia-)
By (5.4b), we have I∗(Z′ ) = 1 and by (5.5b), we
{1,...,5} nition (3.7).
have C∗(Z′ ) = {{1,2},{3,4}}, which in terms of Z ,
{1,...,5} V We first argue that, for any feasible solution F to the r.h.s.
resultsintheclusters{{1,2,3},{4,5}}.Inotherwords,when
of (5.1a),
calledwiththeinputpartition{{1,2},{3},{4},{5},{6}},the (a)
function Fuse returns the threshold value 1 and the partition I(ZSF)≤γℓ.
{{1,2,3},{4,5},{6}}. Let Z′1′ = Z{1,2,3}, Z′2′ = Z{4,5}, and Supposeto the contrarythatI(ZSF)>γℓ. Then, there exists
tZh′3e′ =enZtr6o.pyThfeunmctiinoinmoufmZn′′orm b)aissegxiv(Ujej)noafsgj (defined using Cclu⊇sters.FHoswucehvetrh,aCt C(Z∈ C)γ⊆ℓ(ZPV)bbyyPtrhoepodseifitinointio3n.1(,3w.1h)icohf
{1,2,3} S γℓ V ℓ
contradicts the fact that F ⊆P with |F|>1.
1 2 3 ℓ
Next, we show that (a) can be achieved with equality for
x(j) = ( 0 , 0 ), j =1 somefeasiblesolutionF. ConsideranyC ∈P \P .(Such
Uj ( 0 ), j =2 ℓ−1 ℓ
(cid:26) aC existssincePℓ isstrictly finerthanPℓ−1.)Then,wehave
(b)
By (5.4b), we have I∗(Z′′ ) = 0 and by (5.5b), we have C= F for some F ⊆Pℓ :|F|>1,
{1,2,3}
C∗(Z′{′1,2,3})={1,2,3}, which in terms of ZV, results in the i.e., for some fea[sible solution F.
cluster {1,...,6}. In other words, when called with the input
By Proposition 3.1, we have
partition{{1,2,3},{4,5},{6}},thefunctionFusereturnsthe
thresholdvalue0 and the trivialpartition{1,...,6}, whereat C ∈Cγ(ZV) for all γ ∈[γℓ−1,γℓ),
this point the agglomerative algorithm terminates. ✷ and so I(Z )≥γ , i.e., larger than all values in the interval.
C ℓ
The complexity of the algorithm is mainly due to the The reverse inequality also holds by (a) and (b). Hence, we
computation of the minimum norm base in line 4. This com- have
putationisrepeatedatmost|V|times.WithMNP(l)beingthe I(Z )(=c)γ ,
C ℓ
which implies (5.1a) as desired. since P is optimal to hˆ (V) (4.4) by construction.
N γN
Now, we argue that the above construction gives all the • To explain “≤” for (a), note that PN−1 is an op-
optimal solutions to the r.h.s. of (5.1a), hence establishing timal partition to the Dilworth truncation (4.4a) for
(5.2a). For any C ∈ P \ P , (b) and (c) implies that γ ∈ [γ ,γ ) by Proposition 4.2. By continuity of
ℓ−1 ℓ N−1 N
“⊆” holds for (5.2a), because the fact that C ∈ C (Z ) h [P ] (4.4b) with respect to γ, we have that P
γℓ−1 V γ N−1 N−1
(by Proposition 3.1) means that it is maximal by the defini- is also optimal for γ =γ , i.e.,
N
tion (3.1) of clusters.
(c)
Toarguethereverseinclusion“⊇”,consideranyF belong- hγN[PN]=hγN[PN−1],
ing to the r.h.s. of (5.2a). By (5.1a),
by the optimality of P . Hence, since
N
I(Z )=γ >γ
SF ℓ ℓ−1 h ({i})= h ({i}),
γN γN
andso F ∈Cγℓ−1 by the definition(3.1)of clustersand the i∈XPN C∈XPN−1Xi∈C
maximality of F. This completes the poof of (5.2a).
we have,
S
Considerproving(5.1b)and(5.2b).Foranyoptimalsolution
S
F to (5.1a), we have
h (C)− h ({i}) =0.
P∗(ZSF)=F ={Cj |j ∈B} C∈PXN−1\PN" γN Xi∈C γN #
forsomeB ⊆U,wherethefirstequalityisbyProposition3.2 Eachterminthebracketisnon-negativeasarguedbefore,
since I(Z )>γ for all C ∈F such that C >1. Hence, and so must be equal to zero.
C ℓ
Consider any maximal solution C′ to the r.h.s. of (5.4a).
I(ZSF)=IP∗(ZSF)(ZSF)=IF(ZSF) Then,
=I{{j}|j∈B}(Z′B)≥I(Z′B) γ (≤d)J (Z ) or equivalently,
N T C′
wheretheinequalityfollowsfromthefactthatpartition{{j}|
(e)
j ∈B} into singletons may not be the optimal partition of B hγN(C′)≤ hγN({i}),
for I(Z′ ). The reverse inequality also holds because, for all i∈C′
B X
P′ ∈Π′(B), define
which follow from (b) and (a) respectively because the (5.4a)
P := Cj |C′ ∈P′ ∈Π′ F , does not require C ∈PN−1\PN. With
( )
j[∈C′ (cid:16)[ (cid:17) P′ :={C′}∪{{i}|i∈V \C′},
we have I (Z′ ) = I (Z ) ≥ I(Z ). Here, the in-
P′ B P SF SF
equality follows from the fact that P may not be the optimal (e) implies
partitionof ZSF. This completesthe proofof Theorem5.1.(cid:4) hγN[P′]≤hγN[PN],
PROOF (THEOREM5.2) Consider γN,PN−1, and PN as in which must hold with equality by the optimality of PN.
Proposition 3.1 and let hγ be as in (4.3). By Proposition 4.2, Therefore, (d) also holds with equality, and so we have
we have for all C ∈PN−1\PN that (5.4a) since γN = I∗(ZV). Note that, by (c), PN−1 is also
an optimal partition. Furthermore, it is the largest/coarsest
(a)
h (C)= h ({i}) or equivalently,
γN γN such partition since it is optimal for some γ < γN and
Xi∈C thereforelarger/coarserthan all optimalpartitionsfor γ =γN
γ (=b)J (Z ). by Proposition 4.1. Therefore, the set of maximal optimal
N T C
solutionstother.h.s.of(5.4a)isP \P ,whichisC∗(Z )
• Theequivalencebetween(a)and(b)followsimmediately N−1 N V
trivially, and so we have (5.5a).
from the definition (5.3) of J . More precisely, (b) is
T To prove (5.4b) and (5.5b), let γ∗ = I∗(Z ). Then, it
equivalent to V
follows from (5.4a) that
1
γN = h({i})−h(C) by (4.1) 0= min γ∗−JT(ZC)
|C|−1" # C⊆V:|C|>1
i∈C
⇐⇒ (|C|−1)γ X= h({i})−h(C) ∵|C|>1 =(f) min (|C|−1)γ∗+h(C)− h({i})
N C⊆V:|C|>1
i∈C
i∈C X
⇐⇒ hγN(C)= XhγN({i}) by (4.3) (=g)mj∈iVnB⊆Umj:i|nB|≥1gj(B)+γ∗|B|.
i∈C
X
which is equivalent to (a) as desired. • (f) is obtained by the definition (5.3) of JT and multi-
plying both sides of the equality by |C|−1 >0, which
• “≥” for (a) follows from
does neither violate the equality nor change the set of
h {C}∪{{i}|i∈V \C} ≥h (P ) solutions;
γN γN N
(cid:2) (cid:3)
• (g)isobtainedbychangingthevariableC to(j,B)using (j∗,B∗) is a solution to the r.h.s. of (g). By the inclusion-
the bijection that sets wise maximality of C∗, the set B∗ is also a maximal (the
maximum) solution, i.e.,
j :=mini and B :=C\{j},
i∈C B∗(=k)S (g )
−γ∗ j∗
which is possible because |C|>1. We have also applied ={i∈Uj∗ |i∈S−γ∗(gj∗)}
|B| = |C| − 1 and the definition of gj to rewrite the =(l){i∈U |x(j∗) ≤−γ∗}
expression in the minimization. j∗ i
Consider any optimal solution j∗ to the R.H.S. of (g). Then, (=m)argminx(j∗),
i
we have i∈Uj∗
(h) where (k) is by (4.6); (l) is by (4.10b); and (m) is by (j).
S (g ) 6=∅ λ=−γ∗ Hence, C∗ also belongs to the r.h.s. of (5.5b).
λ j∗
=(i)∅ λ<−γ∗. Conversely, if (j∗,B∗) is an optimal solution to the r.h.s.
of (g), it can be argued easily that C∗ :={j}∗∪B∗ is also a
• (h) is because, by (g), solutiontother.h.s.of(5.4a).ThemaximalsuchC∗ therefore
belongs to the l.h.s. of (5.5b) as desired. This completes the
∅6=max argmin gj∗(B)+γ∗|B| proof. (cid:4)
B⊆Uj∗:|B|≥1
=maxargming (B)+γ∗|B| REFERENCES
j∗
B⊆Uj∗ [1] C. Chan, A. Al-Bashabsheh, Q. Zhou, T. Kaced, and T. Liu, “Info-
=S (g ) clustering: A mathematical theory for data clustering,” IEEE Transac-
−γ∗ j∗
tionsonMolecular,BiologicalandMulti-ScaleCommunications,vol.2,
no.1,pp.64–91,June2016.
The last equality is by the definition (4.6) of S . The
−γ∗ [2] C. Chan, A. Al-Bashabsheh, J. Ebrahimi, T. Kaced, and T. Liu,
first equalityis because allowing B =∅ does notchange “Multivariate mutual information inspired by secret-key agreement,”
the minimum value 0, since Proceedings oftheIEEE,vol.103,no.10,pp.1883–1913, Oct2015.
[3] C.Chan,“Ontightnessofmutualdependenceupperboundforsecret-key
capacity ofmultiple terminals,” arXivpreprintarXiv:0805.3200, 2008.
gj∗(∅)+γ∗|∅|=0 [4] I. Csisza´r and P. Narayan, “Secrecy capacities for multiple terminals,”
IEEETrans.Inf.Theory,vol.50,no.12,Dec.2004.
[5] C.ChanandL.Zheng,“Mutualdependenceforsecretkeyagreement,”
as g (∅). Doing so also does not affect the maximum
j∗ inProceedingsof44thAnnualConferenceonInformationSciencesand
minimizer since the new optimal solution introduced, Systems,2010.
namely ∅, cannot be maximum trivially. [6] C. Chan, “The hidden flow of information,” in Proc. IEEE Int. Symp.
onInf.Theory,St.Petersburg,Russia,Jul.2011.
• To explain (i), note that (g) implies for all B ⊆ Uj∗ : [7] ——,“Matroidal undirected network,” inInformationTheoryProceed-
|B|>1 that ings (ISIT), 2012 IEEE International Symposium on, July 2012, pp.
1498–1502.
g (B)+γ∗|B|≥0 and so [8] H. Narayanan, “The principal lattice of partitions of a submodular
j∗ function,”LinearAlgebraanditsApplications, vol.144,no.0,pp.179
g (B)−λ|B|≥(−λ−γ∗)|B| ∀λ∈R –216,1990.
j∗
[9] R.W.Yeung,InformationTheoryandNetworkCoding. Springer,2008.
>0 ∀λ<−γ∗.
[10] K. Nagano, Y. Kawahara, and S. Iwata, “Minimum average cost clus-
tering.” in NIPS, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor,
Inotherwords,forλ<−γ∗,wehavethat∅istheunique R.S.Zemel, andA.Culotta, Eds. Curran Associates, Inc.,2010, pp.
solution to min g (B)+γ∗|B|, which implies (i) 1759–1767.
B⊆Uj∗ j∗ [11] Y. Wu and P. Yang, “Minimax rates of entropy estimation on large
by the definition (4.6) of Sλ. alphabetsviabestpolynomialapproximation,”IEEETrans.Inf.Theory,
Now, (h) and (i) implies that vol.62,no.6,pp.3702–3720, 2016.
[12] S. Fujishige, “Lexicographically optimal base of a polymatroid with
respecttoaweightvector,”MathematicsofOperationsResearch,vol.5,
−γ∗ =sup{λ∈R|S (g )=∅}
λ j∗ no.2,pp.186–196,1980.
= min min λ [13] ——,Submodularfunctionsandoptimization, 2nded. Elsevier,2005.
i∈Uj∗λ∈R:i∈Sλ(gj∗) [14] ——, “Theory of principal partitions revisited,” in Research Trends in
Combinatorial Optimization. Springer, 2009,pp.127–162.
=(j) min x(j∗) [15] C.Chan,A.Al-Bashabsheh,Q.Zhou,andT.Liu,“Dualitybetweenfea-
i
i∈Uj∗ tureselection anddataclustering,” in54thAnnualAllertonConference
on Communication, Control, and Computing, Allerton Retreat Center,
where the last equality (j) is by (4.10a) since x(j∗) denotes Monticello, Illinois, 2016.
Uj∗ [16] P.NarayanandH.Tyagi,“Multiterminal secrecybypublicdiscussion,”
the minimum norm base for g . The equality implies (5.4b) Foundations and Trends in Communications and Information Theory,
j∗
as desired. vol.13,no.2-3,pp.129–275,2016.
[17] S. Fujishige, “Polymatroidal dependence structure of a set of random
To prove (5.5b), consider any set C∗ that belongs to the variables,” InformationandControl, vol.39,no.1,pp.55–72,1978.
r.h.s. of (5.5a). Applying the bijection [18] C. E. Shannon, “A mathematical theory of communication,” The Bell
System Technical Journal, vol.27,no.3,pp.379–423,July1948.
[19] A. Schrijver, Combinatorial Optimization: Polyhedra and Efficiency.
j∗ := mini and B∗ :=C∗\{j∗},
Springer, 2002.
i∈C∗
[20] S. Fujishige and S. Isotani, “A submodular function minimization
algorithm based on the minimum-norm base,” Pacific Journal of Op-
timization, vol.7,no.1,pp.3–17,2011.
[21] S. Watanabe, “Information theoretical analysis of multivariate correla-
tion,” IBM Journal of Research and Development, vol. 4, no. 1, pp.
66–82,1960.
[22] C. Chan and T. Liu, “Clustering of random variables by multivariate
mutual information on Chow-Liu tree approximations,” in Fifty-Third
AnnualAllertonConference onCommunication, Control, andComput-
ing,Allerton Retreat Center, Monticello, Illinois, Sep.2015.
[23] V. Kolmogorov, “A faster algorithm for computing the principal se-
quenceofpartitions ofagraph,”Algorithmica,vol.56,no.4,pp.394–
412,2010.
[24] C.Chan,A.Al-Bashabsheh,Q.Zhou,N.Ding,T.Liu,andA.Sprintson,
“Successive omniscience,” IEEETrans.Inf. Theory, vol. 62,no. 6,pp.
3270–3289, June2016.