Table Of Content1
Scalable Nearest Neighbor Search
based on kNN Graph
Wan-Lei Zhao*, Jie Yang, Cheng-Hao Deng
Abstract—Nearestneighborsearchisknownasachallengingissuethathasbeenstudiedforseveraldecades.Recently,thisissue
becomesmoreandmoreimminentinviewingthatthebigdataproblemarisesfromvariousfields.Inthispaper,ascalablesolutionbased
onhill-climbingstrategywiththesupportofk-nearestneighborgraph(kNN)ispresented.Twomajorissueshavebeenconsideredinthe
paper.Firstly,anefficientkNNgraphconstructionmethodbasedontwomeanstreeispresented.Forthenearestneighborsearch,an
enhancedhill-climbingprocedureisproposed,whichseesconsiderableperformanceboostoveroriginalprocedure.Furthermore,with
thesupportofinvertedindexingderivedfromresiduevectorquantization,ourmethodachievescloseto100%recallwithhighspeed
7
efficiency in two state-of-the-art evaluation benchmarks. In addition, a comparative study on both the compressional and traditional
1
nearestneighborsearchmethodsispresented.Weshowthatourmethodachievesthebesttrade-offbetweensearchquality,efficiency
0
andmemorycomplexity.
2
b
e IndexTerms—nearestneighborsearch,k-nngraph,hill-climbing.
F (cid:70)
3
]
V
1 INTRODUCTION perform the NN search given kNN graph is supplied. For
C
thefirstissue,ithasbeenaddressedbythescheme“divide-
. Nearest neighbor search (NNS) generally arises from a
s and-conquer” [10], [11] and nearest neighbor descent [12].
wide range of subjects, such as database, machine learn-
c
The combination of these two schemes is also seen from
[ ing, computer vision, and information retrieval. Due to the
recentwork[6].Thankstotheaboveschemes,thecomplex-
fundamental role it plays, it has been studied in many
2 ityofconstructingkNNgraphhasbeenreducedtoaround
computer fields for several decades. The nearest neighbor
v O(D·n·log(n)). With the support of kNN graph, nearest
search problem can be simply defined as follows. Given a
5
7 queryvector(q ∈ RD),andncandidatesthatareunderthe neighbor search is conducted by hill-climbing strategy [4].
However, this strategy could be easily trapped in local op-
4 same dimensionality. It is required to return sample(s) for
8 thequerythatarespatiallyclosesttoitaccordingtocertain timasincetheseedsamplesarerandomlyselected.Recently,
0 metric(usuallyl1-distanceorl2-distance). thisissuehasbeenaddressedbysearchperturbation[5],in
1. Traditionally, this issue has been addressed by various which an intervention scheme is introduced as the search
0 space partitioning strategies. However, these methods are procedure is trapped in a local optima. As indicated in
7 hardly scalable to high dimensional (e.g., D > 20), large- recent work [6], this solution is sub-optimal. Alternatively,
1 scale and dense vector space. In such case, most of the with the support of k-d trees to index the whole reference
v: traditional methods such as k-d tree [1], R-tree [2] and set, the hill climbing search is directed to start from the
i locality sensitive hashing (LSH) [3] are unable to return neighborhood of potential nearest neighbors [6]. However,
X
decentresults. inordertoachievehighperformance,multiplek-dtreesare
ar Recently, there are two major trends in the literature. required,whichcausesconsiderablememoryoverhead.
In one direction, the nearest neighbor search is conducted Considering the big memory cost with the traditional
basedonk-nearestneighborgraph(kNNgraph)[4],[5],[6], methods, NNS is addressed based on vector quantization.
inwhichthekNNgraphisconstructedoffline.Alternatively, Therepresentativemethodsareproductquantizer(PQ)[7],
NNSisaddressedbasedonvectorquantization[7],[8],[9]. additive quantizer (AQ) [8], composite quantizer (CQ) [9]
The primary goal of this way is to compress the reference and stacked quantizer (SQ) [13]. Essentially in all these
set by vector quantization. Such that it is possible to load methods, vectors in the reference set are compressed via
thewholereferenceset(aftercompression)intothememory vector quantization. The advantages that the quantization
inthecasethatthereferencesetisextremelylarge. methods bring are two folds. On one hand, the candidate
TherearetwokeyissuestobeaddressedinkNNgraph vectors have been compressed (typically the memory con-
based methods. The primary issue is about the efficient sumption is one order of magnitude lower than the size
construction of the kNN graph, given the original problem of reference set), which makes it possible to scale-up the
is at the complexity of O(D·n2). Another issue is how to nearest neighbor search to very large range. On the other
hand, the distance computation between query and all the
candidates becomes very efficient when it is approximated
• Fujian Key Laboratory of Sensing and Computing for Smart City, and
theSchoolofInformationScienceandEngineering,XiamenUniversity, by the distances between query and vocabulary words.
Xiamen,361005,P.R.China. However, due to the approximation in both vector repre-
Wan-LeiZhaoisthecorrespodingauthor.E-mail:[email protected].
sentationanddistancecalculation,highrecallisundesirable
2
withthecompressionalmethods. and multiple k-d trees. Although efficient, sub-optimal re-
In our paper, similar as [4] [5] [6], NNS is basically sultsareachieved.
addressed via hill-climbing strategy based on kNN graph. For all the aforementioned tree partitioning methods,
Two major issues are considered. Firstly, an efficient and another major disadvantage lies in their heavy demand in
scalable kNN graph construction method is presented. In memory.Ononehand,inordertosupportfastcomparison,
addition,differentfromexistingstrategies,thehill-climbing allthecandidatevectorsareloadedintothememory.Onthe
searchisdirectedbymulti-layerresiduevectorquantization other hand, the tree nodes that are used for indexing also
(RVQ) [14]. The residue quantization forms a coarse-to- takeupconsiderableamountofextramemory.Overall,the
fine partition over the whole vector space. For this reason, memory consumption is usually several times bigger than
it is able to direct the hill-climbing search to start most thesizeofcandidateset.
likely from within the neighborhood of the true nearest Apart from tree partitioning method, several attempts
neighbor.Sincetheinvertedindexingderivedfromresidue havebeenmadetoapplylocalitysensitivehashing(LSH)[3]
quantization only keeps the IDs of the reference vectors, on NNS. In general, there are two steps involved in the
in comparison to method in [6], ignorable extra memory search stage. Namely, step 1. collects the candidates that
is required. As a result, the major memory cost of our share the same or similar hash keys as the query; step 2.
method is to hold the reference set. Furthermore, with the performsexhaustivecomparisonbetweenthequeryandall
support of the multi-layer quantization, sparse indexing these selected candidates to find out the nearest neighbor.
over the big reference set is achievable with a series of SimilarasFLANN,LSHisgoodfortheapplicationthatonly
smallvocabularies,whichmakesthesearchprocedurevery requires approximate nearest neighbor. Moreover, in order
efficientandeasilyscale-uptobillionleveldataset. tosupportthefastcomparisoninstep2,thewholereference
The remaining of this paper is organized as follows. setarerequiredtobeloadedinthememory,whichleadsto
In Section 2, a brief review about the literature of NNS alotofmemoryconsumption.
is presented. Section 3 presents an efficient algorithm for In recent years, the problem of NNS is addressed by
kNN graph construction. Based on the constructed kNN vector quantization [18]. The proposal of applying prod-
graph, an enhanced hill-climbing procedure for nearest uct quantizer [7] on NNS opens up new way to solve
neighborsearchisthereforepresented.Section4showshow this decades old problem. For all the quantization based
the inverted indexing structure is built based on residue methods [19], [7], [20], [9], [13], they share two things in
vector quantization, which will boost the performance of common. Firstly, the candidate vectors are all compressed
hill-climbing procedure. The experimental study over the via vector quantization. This makes it easier than previous
enhancedhill-climbingnearestneighborsearchispresented methodstoholdthewholedatasetinthememory.Secondly,
inSection5. NNS is conducted between the query and the compressed
candidate vectors. The distance between query and candi-
dates is approximated by the distance between query and
2 RELATED WORKS vocabularywordsthatareusedforquantization.Duetothe
heavy compression on the reference vectors, high recall on
The early study about the issue of NNS could be traced thetopishardlydesirablewiththesemethods.
back to 1970s, when the need of NNS search over file Recently, the hill-climbing strategy has been also ex-
system arises. In those days, the data to be processed are plored[4],[5],[6].Basically,nosophisticatedspacepartition-
in very low dimension, typically 1D. This problem is well- ingisrequiredintheoriginalidea[4].TheNNSstartsfrom
addressedbyB-Tree[15]anditsvariantsB+-tree[15],based a group of random seeds (random location in the vector
onwhichtheNNScomplexitycouldbeaslowasO(log(n)). space). The iterative procedure tranverses over the kNN
However,B-treeisnotnaturallyextensibletomorethan1D graph by depth-first search. Guided by kNN graph, the
case.Moresophisticatedindexingstructuresweredesigned searchprocedureascentsclosertothetruenearestneighbor
to handle NNS in multi-dimensional data. Representative in each round. It takes only few number of rounds to
structuresarek-d-tree[1],R-tree[2]andR*-tree[16].Fork- converge.Sincetheprocedurestartsfromrandomposition,
dtree,pivotvectorisselectedeachtimetosplitthedataset it is likely to be trapped in local optima. To alleviate this
evenly into two. By applying this bisecting repeatedly, the issue,aninterventionschemeisintroducedin[5].However,
hyper-space is partitioned into embedded hierarchical sub- thismethodturnsouttobesub-optimal.Recently,thisissue
spaces. The NNS is performed by tranversing over one is addressed by indexing the reference set by multiple k-
or several branches to probe the nearest neighbor. The d trees, which is similar as FLANN [17]. When one query
space partitioning aims to restrict the search taking place comes, the search tranverses over these k-d trees. Potential
within minimum number of sub-spaces. However, unlike candidatesarecollectedastheseedsforhill-climbingsearch.
B-tree in 1D case, the partition scheme does not exclude Suchthatthesearchprocesscouldbefasterandthepossibil-
thepossibilitythatnearestneighborresidesoutsideofthese itythatistrappedinalocaloptimaislower.Unfortunately,
candidatesub-spaces.Therefore,extensiveprobingoverthe such kind of indexing causes nearly one times of memory
large number of branches in k-d tree becomes inevitable. overheadtoloadthek-dtrees,whichmakesitinscalableto
For this reason, NNS with k-d tree could be very slow. large-scalesearchtask.
Similar comments apply to R-tree and its variants despite Hill-climbing strategy is adopted in our search scheme
thatamoresophisticateddesignonthepartitionstrategyis duetoitssimplicityandefficiency.Differentfrom[5],[6],the
proposedinthesetreestructures.Recentindexingstructure searchprocedureisdirectedbyresiduevectorquantization
FLANN[17]partitionsthespacewithhierarchicalk-means (RVQ) [19]. Firstly, the candidate vectors are quantized by
3
RVQ. However, different from compressional methods, the graph G is refined as new and closer neighbors join in.
quantized codes are used only to build inverted indexing. Notice that the result of two means clustering is different
Candidate vectors sharing the same RVQ codes (indexing from one round to another. We find it is sufficient to set I0
key) are organized into one inverted list. In comparison to10formillionleveldataset.Theoverallcomplexityofthis
to [6], only 4 bytes are required to keep the index for one procedureisO(I0·n·log(n)·D),wheren·log(n)·Disactually
vector.Whenaquerycomes,thedistancebetweenthequery thecomplexityoftwomeansclustering.
and the RVQ codes are calculated, the candidates reside
in the lists of top-ranked codes are taken as the seeds for
3.2 EnhancedHill-climbingProcedureforNNS
hill-climbing procedure. The inverted indexing built from
Oncethek-NNgraphisready,itisquitenaturaltoadoptthe
RVQ codes plays the similar role as k-d trees in [6], while
hill-climbingscheme[4]toperformtheNNS.However,we
requiringverysmallamountofmemory.
find that the procedure proposed in [4] is easily trapped in
local optima and involves redundant visits to unlikely NN
3 NNS BASED ON K-NN GRAPH CONSTRUCTION candidates.Forthisreason,theprocedureisrevisedslightly
3.1 k-NNGraphConstructionwithTwoMeansCluster- inthepaper,whichisgiveninAlg.2.
ing Algorithm2.EhancedHill-ClimbingNNS
In this section, a novel kNN graph construction method 1: Input:q:query,G:k-NNGraph,
based on two means tree [21] is presented. Basically, we 2: S:seeds,R←φ:NNrank-list
follow the strategy of divide-and-conquer [10], [11], [6]. 3: Output:R
While different from [10], [11], [6], the division over the 4: R←R∪{<s,d(q,s)>},s∈S;t←0;
vector space is undertaken by two means clustering in our 5: whilet<t0 do
paper. The motivation behind this algorithm is based on 6: T←φ;
the observation that samples in one cluster are most likely 7: foreachri ∈top-kofRdo
neighbors. The general procedure that constructs the kNN 8: foreachnj ∈G[ri]do
graph is in two steps. Firstly, the reference set is divided 9: if nj isNOTvisitedthen
into small clusters, where the size of each cluster is no 10: T←T∪{<nj,d(q,nj)>};
bigger than a threshold (i.e., 50). This could be efficiently 11: endif
fulfilled by bisecting the reference set repeatedly with two 12: endfor
meansclustering.Inthesecondstep,anexhaustivedistance 13: endfor
comparison is conducted within each cluster. The kNN list 14: R←R∪T;
for each cluster member is therefore built or updated. This 15: t←t+1;
general procedure could be repeated for several times to 16: endwhile
refinethekNNlistofeachsample.
end
Algorithm1.kNNGconstruction
In Alg. 2, seeds could be randomly selected as [4] does or
1: Input:Xn×d:referenceset,k:scaleofk-NNGraph suppliedbyinvertedindexing,whichwillbeintroducedin
2: Output:kNNGraphGn×k Section4.Aswillbeseenintheexperiments,seedssupplied
3: InitializeGn×k;t←0; byinvertedindexingleadtosignificantperformanceboost.
4: fort<I0 do As shown in Alg. 2, the major modification on the
5: ApplytwomeansclusteringonXn×d originalalgorithmisonthewayofkNNexpansion.Asseen
6: CollectclusterssetS
from Line 8-14, the NN expansion is undertaken for each
7: foreachSm ∈S do top-k sample in one iteration. In contrast, algorithm given
8: foreach<i,j >do//(i,j ∈Sm &i(cid:54)=j) in [4] expands kNN only for the top-ranked sample in one
9: if <i,j >isNOTvisitedthen
iteration, which does not allow all the seeds to hold equal
10: UpdateG[i]andG[j]withd(xi,xj); chancetoclimbthehill.Wefindthattheoriginalprocedure
11: endif
ismorelikelytrappedinlocaloptima.Aswillbeseeninthe
12: endfor
experiment,morenumberofiterationsisneededtoreachto
13: endfor
thesamelevelofsearchqualityasourenhancedversion.
14: t←t+1
15: endfor
end 4 FAST NEAREST NEIGHBOR SEARCH
4.1 ReviewonResidueQuantization
AsshowninAlg.1,thekNNgraphconstructionmainly
relies on two means clustering, which forms a partition on The idea of approximating vectors by the composition of
the input reference set in each round. Comparing with k-d several vectors could be traced back to the design of “two-
tree,twomeansproducesballshapeclustersinsteadofcubic stage residue vector quantization” [23]. In this scheme, an
shapepartition.Inordertoachievehighquality,thecluster- input vector is encoded by a quantizer, and its residue
ing is driven by objective function of boost k-means [22], (quantization error). The residue vector is sequentially en-
which always leads to higher clustering quality. With the codedbyanotherquantizer.Thisschemehasbeenextended
clustering result of each round, the brute-force comparison to several orders (stages) [14], in which the residue vector
is conducted within each cluster. Due to the small scale of that is left from the prior quantization stage is quantized
each cluster, this process could be very efficient. The kNN recursively.Givenvectorv ∈ RD,andaseriesvocabularies
4
{V1,··· ,Vk}, vector v is approximated by a composition v ≈w(1)+w(2)
id c c 11 7
of words from these vocabularies. In particular, one word 1 2
is selected from one stage vocabulary. Given this recursive 6 12
quantization is repeated to k-th stage, vector v is approxi-
matedasfollows.
v ≈w(1)+···+w(m)+···+w(k) (1) x⏟cx1xxcx2 c1 c2 id
i1 im ik
indexkey
Irn=Eqvn.−1,(cid:80)wi(j(mmm=1−) 1∈)wVmi(jj)incosltlaegcetemd farroemtramin−ed1onstathgee.reFsiindaulleys, 4 3
vector v after RVQ is encoded as c1···ci···cm, where ci is
the word ID from vocabulary Vi. When only one stage is Fig. 1: An illustration of inverted indexing derived from
employed,residuequantizationisnothingmorethanvector residuevectorquantization.Onlytwo-layercaseisdemon-
quantization, which has been well-known as Bag-of-visual strated, while it is extensible to more than two-layer cases.
wordmodel[24]. RVQ is shown on the left, the inverted indexing with RVQ
Asdiscussedin[19],multiple-stageRVQformsacoarse- codes is shown on the right. Vector vid is encoded by wc(11)
troo-lfienaesphaierrtiatricohnicoavlekr-mtheeavnesc(tHorKsMpa)cien, wFLhAicNhNpl[a1y7s].siWmhilialer andwc(22),whicharethewordsfromthefirstandthesecond
layervocabulariesrespectively.
different from HKM, the space complexity of RVQ is very
small given there is no need to keep the big amount of
high-dimensional tree nodes. In addition, RVQ is able to
produceabigamountofdistinctivecodeswithsmallvocab- In order to speed-up the search, it is more favorable to
ulariesasmultiplestagesareemployed.Forinstance,given calculate the distance between query and the first order
m = 4,|Vi=1···4| = 256, the number of distinctive codes is codes first. Thereafter, the early pruning is undertaken by
asmuchas232. ignoringkeyswhosefirstordercodesarefarawayfromthe
query.Toachievethis,Eqn.2isrewrittenas
4.2 InvertedIndexingwithRVQandCascadedPruning
d(q,I)=q·qt−2·w(1)·qt−w(1)2
In RVQ encoding, the energy of a vector is mainly kept in c1 c1
(cid:124) (cid:123)(cid:122) (cid:125)
thecodesoflowerorder.Similaras[25],[19],inourdesign, term1
the codes from multiple stage RVQ are combined as the −2·w(2)·qt−w1 ·w(2)t−w(2)2. (3)
indexingkey,whichindicatestheroughlocationofavector (cid:124) c2 (cid:123)c(cid:122)1 c2 c2 (cid:125)
in the space. Such that all the vectors that reside close to term2
this location share the same indexing key. The more orders
As shown in Eqn. 3, the search process will calculate “term
of codes are incorporated, the more precise this location
1”first.Onlykeysthatarerankedattoparejoinedintothe
andthelargeroftheinvertedindexingspace.Consequently,
distance calculation in the 2nd term. Based on d(q,I), the
theencodedvectorsaresparselydistributedintheinverted
indexing keys are sorted again. Only the inverted lists that
lists.
are pointed by the top-ranked keys are visited. Note that
When a query comes, the search process calculates the
this cascade pruning scheme is extensible to the case that
distance between query and indexing keys. On the first
morethantwoordersofcodesareusedastheindexingkey.
hand, the indexing keys are decomposed back into several
Ineachinvertedlist,onlytheIDsofvectorswhichsharethe
RVQcodes.Theasymmetricdistancebetweenthequeryand
sameRVQcodesarekept.
thesecodesarethereforecalculatedandsorted.Givenquery
qandtheindexingkeyI formedbythefirsttwoorderRVQ In the above analysis, we only show the case that in-
codes (i.e., I = c1c2), the distance between q and key I is vertedindexingkeysproducedfromthefirsttwoordersof
givenas RVQ codes. However, it is extensible to the case of more
ordersofRVQcodesareemployed.Itisactuallypossibleto
d(q,I)=(q−w(1)−w(2))2 generatekeyspaceoftrillionlevelwithonlyfourordersof
c1 c2
=q·qt−2·(w(1)+w(2))·qt+(w(1)+w(2))2, (2) RVQcodes.Forinstance,ifvocabularysizesofthefirstfour
c1 c2 c1 c2 ordersaresetto213,28,28and28,thevolumeofkeyspaceis
where w1 and w2 are respectively words from the first asbigas237,whichissufficienttobuildinvertedindicesfor
c1 c2
orderandthesecondordervocabularies. trillion level reference set. In practice, the number of layers
Attheinitialstageofsearch,theinner-productsbetween tobeemployedcloselyrelatedtothesizeofreferenceset.
queryandallthevocabularywordsarecalculatedandkept Intuitively, the larger of the reference set, the more
in a table. In addition, for the third term in Eqn. 2, it numberoflayersshouldbeincorporated.Accordingtoour
involves the calculation of inner product between w1 and observation,formillionleveldataset,two-layerRVQmakes
w2 .Inordertofacilitatefastcalculation,theinnerproc1ducts agoodtrade-offbetweenefficiencyandsearchquality.Dur-
c2
between words from the first order vocabulary and the ingtheonlinesearch,allthecandidatevectorsthatresidein
wordsfromthesecondordervocabularyarepre-calculated the top-ranked inverted lists are collected as the candidate
andkeptforlookingup.Asaresult,fastcalculationofEqn.2 seeds.Theseedsareorganizedasapriorityqueueaccording
isachievablebyperformingseverallook-upoperations. tod(q,I)andarefedtoAlg.2.
5
5 EXPERIMENT 100
In this section, the experiments on SIFT1M and GIST1M 95
are presented. In SIFT1M and GIST1M, there are 10,000 %) 90
and1,000queriesrespectively.Inthefirsttwoexperiments, all (
following the practice in [6], the curve of average recall e rec 85
againsttotalsearchtime(ofallqueries)ispresentedtoshow ag
thetrade-offthatonemethodcouldachievebetweensearch Aver 80
qualityandsearchefficiency.Thecurveisplottedbyvarying 75
thenumberofsamplestobeexpandedineachroundoftop- GNNS
E-GNNS
rankedlistandthenumberofiterations. 70
0 20 40 60 80 100 120 140 160 180
We mainly study the performance of the enhanced Search time (s)
GNNS (E-GNNS) and the performance of E-GNNS when (a) SIFT1M
it is supported by inverted indexing (IVF+E-GNNS). Their
90
performance is presented in comparison to state-of-the-art
methodssuchasoriginalGNNS[4],EFANNA[6],whichis 85
the most effective method in recent studies, FLANN [17],
E2LSH [3] and ANN [26]. For E-GNNS and IVF+E-GNNS, all (%) 80
wexepeursiemtehnetss,anmaemseclayle30ofaknNdN50grfoaprhSsIFaTs1EMFAaNndNAGIiSnT1thMe ge rec 75
a
respectively.ForIVF-E-GNNS,twolayersofRVQquantizer Aver 70
are adopted and their vocabulary sizes are set to 256 for
65
eachonbothSIFT1MandGIST1M.Alltheexperimentshave
GNNS
beenpulledoutbyasinglethreadonaPCwith3.6GHzCPU E-GNNS
60
and16Gmemorysetup. 0 20 40 60 80 100 120
Search time (s)
(b) GIST1M
5.1 E-GNNSversusGNNS
Fig. 2: NNS quality against search time for GNNS and E-
In our first experiment, we try to show the performance
GNNS. The scale of kNN graph is fixed to 30 and 50 for
improvementachievedbyE-GNNSoverGNNS.Asshown
SIFT1MandGIST1Mrespectively.
in Figure 2, with the revised procedure (Alg. 2), significant
performanceboostisobservedwithE-GNNS.Ononehand,
E-GNNStakeslessnumberofiterationstoconverge.Onthe TABLE 1: Parameter settings for non-compressional meth-
other hand, it becomes less likely to be trapped in a local ods
optimaasitisabletoreachtoconsiderablyhigherrecall.
SIFT1M GIST1M
ANN[26] ε=0 ε=5
5.2 E-GNNSsupportedbyInvertedIndex E2LSH[3] r=0.45 r=1
p=0.9, p=0.8,
FLANN[17]
In the second experiment, we study the performance of E- ωm=0.01,ωt=0 ωm=0.01,ωt=0
EFANNA[6] nTree=4,kNN=30 nTree=4,kNN=50
GNNS when it is supported by inverted indexing derived
from RVQ. The performance of EFANNA is also presented
5.3 ComparisontoNon-compressionalMethods
for comparison, which similarly follows the scheme of
GNNS. Comparing to EFANNA, we adopt inverted in- Inthisexperiment,theperformanceofIVF+E-GNNSiscom-
dexing instead of multiple k-d trees to direct the search. pared to both compressional methods such as IVFPQ [7],
Moreover,theenhancedGNNSprocedureisadoptedinour IVFRVQ[19]andIMI[27]andnon-compressionalmethods
case.AsshowninFigure3,E-GNNSsupportedbyinverted such as E2LSH [3], ANN [26] and FLANN [17]. The speed
indexing shows considerable performance improvement. efficiency, search quality and memory cost of all these
With similar search time, its average recall is boosted more methodsonSIFT1MandGIST1Mareconsideredandshown
than 10% across two datasets. Compared to EFANNA, in Table 2. In particular, the memory cost is shown as the
IVF+E-GNNS converges to significantly higher recall in ratiobetweenthememoryconsumptionofonemethodand
shorter time. While for EFANNA, it takes shorter time the size of reference set. All results reported for the non-
than IVF+E-GNNS only when the requirement of search compressional methods are obtained by running the codes
quality is low. In order to reach to high recall, the search providedbytheauthors.Theconfigurationononemethod
procedure could be unexpectedly elongated. We find that issettowherebalanceismadebetweenspeedefficiencyand
its performance could be even degenerated as one chooses search quality (see detailed configurations in Table 1). The
larger expansion factor (comparable to top-k in IVF+E- performancefromexhaustiveNNSissuppliedforreference.
GNNS). This mainly indicates GNNS directed by multiple As seen from Table 2, ANN shows stable search quality
k-dtreesiseasiertobetrappedinlocaloptimathanthatof and best trade-off on two datasets in terms of its recall,
IVF+E-GNNS. Notice that IVF+E-GNNS also outperforms however its speed is only several times faster than brute-
the method proposed in [5] by a large margin as it is force search. Among all methods considered in this experi-
reportedin[6]. ment,IVF+E-GNNSshowsthebesttrade-offbetweenmem-
6
TABLE2:Comparisononaveragetimecostperquery(ms),memoryconsumption(relativeratio)andquality(recall@top-
k) of all representative NNS Methods in the literature. In terms of memory consumption, the ratio is taken between real
memoryconsumptionofonemethodagainsttheoriginalsizeofthecandidatedataset
SIFT1M GIST1M
T(ms) Memory R@1 R@100 T(ms) Memory R@1 R@100
IVFPQ-8[7] 8.6 0.086× 0.288 0.943 36.2 0.0029× 0.082 0.510
IVFRVQ-8[19] 4.7 0.070× 0.270 0.931 12.5 0.0023× 0.107 0.616
IMI-8[27] 95.1 0.086× 0.228 0.924 90.0 0.0029× 0.053 0.369
FLANN[17] 0.9 6.25× 0.852 0.852 11.9 2.12× 0.799 0.799
ANN[26] 250.5 8.45× 0.993 1.000 232.0 1.48× 0.938 1.000
E2LSH[3] 30.0 29.10× 0.764 0.764 6907.7 1.74× 0.342 0.990
E-GNNS 2.3 1.225× 0.882 0.882 6.7 1.049× 0.724 0.724
IVF+E-GNNS 1.8 1.234× 0.983 0.983 8.0 1.053× 0.943 0.943
EFANNA[6] 1.8 1.621× 0.955 0.955 9.5 1.049× 0.891 0.891
ExhaustiveNNS 1,107.0 1.0× 1.00 1.00 2,258.0 1.0× 1.00 1.00
100 scalable to very large scale dataset. With the support of
constructed kNN graph, an enhanced hill-climbing proce-
95
dure has been presented. Furthermore, with the support of
%) 90 invertedindexing,itshowssuperiorperformanceovermost
all ( oftheNNSmethodsintheliterature.Wealsoshowthatthis
e rec 85 method is easily scalable to billion level dataset with the
g
Avera 80 sreuspidpuoretvoefcttohreqiunavnertitzeadtiionnd.exing derived from multi-layer
75
IVF+E-GNNS EFANNA
E-GNNS
70 REFERENCES
10 20 30 40 50 60
Search time (s)
[1] J. L. Bentley, “Multidimensional binary search trees used for
(a) SIFT1M associativesearching,”CommunicationsofACM,vol.18,pp.509–
517,Sep.1975.
100 [2] A. Guttman, “R-trees: A dynamic index structure for spatial
searching,”inProceedingsofthe1984ACMSIGMODinternational
95
conference on Management of data, vol. 14, (New York, NY, USA),
90 pp.47–57,ACM,Jun.1984.
e(s) 85 [3] sMen.sDitaivtaer,hNa.shIminmgosrclhiceam, Pe.bInadseydk,oanndpV-s.taSb.lMe idrriostkrnibi,u“tiLooncsa,”lityin-
m
h ti 80 Proceedings of the Twentieth Annual Symposium on Computational
Searc 75 [4] GKe.oHmaetjeryb,i,(NY.ewAbYboarski,-YNaYd,kUorS,AH),.pSph.a2h5b3a–z2i6,2a,nAdCHM.,Z20h0a4n.g, “Fast
70 approximate nearest-neighbor search with k-nearest neighbor
graph,” in International Joint Conference on Artificial Intelligence,
65 IVF+E-GNNS EFANNA pp.1312–1317,2011.
60 E-GNNS [5] J. Wang and S. Li, “Query-driven iterated neighborhood graph
0 5 10 15 20 25 30 35 40 45 search for large scale indexing,” in Proceedings of the 20th ACM
correct neighbors(%) International Conference on Multimedia, (New York, NY, USA),
pp.179–188,ACM,2012.
(b) GIST1M
[6] C. Fu and D. Cai, “EFANNA : An extremely fast approximate
nearestneighborsearchalgorithmbasedonknngraph,”arXiv.org,
Fig.3:NNSqualityagainstsearchtimeforE-GNNS,IVF+E- 2016. arXiv:1609.07228.
GNNS and EFANNA. The scale of kNN graphs in SIFT1M [7] H. Je´gou, M. Douze, and C. Schmid, “Product quantization for
nearest neighbor search,” Trans. PAMI, vol. 33, pp. 117–128, Jan.
and GIST1M is fixed to 30 and 50 respectively for all the
2011.
methods.
[8] A.BabenkoandV.Lempitsky,“Additivequantizationforextreme
vectorcompression,”inCVPR,pp.931–938,2014.
[9] T. Zhang, C. Du, and J. Wang, “Composite quantization for ap-
proximatenearestneighborsearch,”inICML,pp.838–846,2014.
ory cost, speed efficiency and search quality. Additionally, [10] J. L. Bentley, “Multidimensional divide-and-conquer,” Commun.
comparing with all other non-compressional methods, its ACM,vol.23,pp.214–229,Apr.1980.
[11] J.Wang,J.Wang,G.Zeng,Z.Tu,R.Gan,andS.Li,“Scalablek-nn
memoryconsumptionisonlyslightlylargerthanthesizeof
graph construction for visual descriptors,” in CVPR, pp. 1106–
referenceset. 1113,2012.
[12] W.Dong,C.Moses,andK.Li,“Efficientk-nearestneighborgraph
constructionforgenericsimilaritymeasures,”inProceedingsofthe
6 CONCLUSION 20thInternationalConferenceonWorldWideWeb,WWW’11,(New
York,NY,USA),pp.577–586,ACM,2011.
We have presented our solution for efficient nearest neigh- [13] J. Martinez, H. H. Hoos, and J. J. Little, “Stacked
bor search, which is based on hill-climbing strategy. Two quantizers for compositional vector compression,” Arxiv.org.
https://arxiv.org/abs/1411.2173.
major issues have been discussed in the paper. Firstly, a
[14] C. F. Barnes, S. A. Rizvi, and N. M. Nasrabadi, “Advances in
novel kNN graph construction method is proposed. Its
residualvectorquantization:areview.,”IEEETransactionsonImage
time complexity is only at O(n·log(n)·D), which makes it Processing,vol.5,no.2,pp.226–262,1996.
7
[15] D.Comer,“Ubiquitousb-tree,”ACMComputingSurveys,vol.11,
pp.121–137,Jun.1979.
[16] N.Beckmann,H.-P.Kriegel,R.Schneider,andB.Seeger,“TheR*-
tree:anefficientandrobustaccessmethodforpointsandrectan-
gles,”inInternationalConferenceonManagementofData,pp.322–
331,1990.
[17] M.MujaandD.G.Lowe,“Scalablenearestneighboralgorithms
for high dimensional data,” Trans. PAMI, vol. 36, pp. 2227–2240,
2014.
[18] R.M.GrayandD.L.Neuhoff,“Quantization,”IEEETransactions
onInformationTheory,vol.44,pp.2325–2383,Sep.2006.
[19] Y.Chen,T.Guan,andC.Wang,“Approximatenearestneighbor
searchbyresidualvectorquantization,”Sensors,vol.10,pp.11259–
11273,2010.
[20] A.BabenkoandV.Lempitsky,“Efficientindexingofbillion-scale
datasetsofdeepdescriptors,”inCVPR,pp.2055–2063,2016.
[21] N. Verma, S. Kpotufe, and S. Dasgupta, “Which spatial parti-
tiontreesare adaptivetointrinsic dimension?,”inProceedingsof
the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence,
pp.565–574,2009.
[22] W.-L. Zhao, C.-H. Deng, and C.-W. Ngo, “Boost
k-means,” Arxiv.org, vol. abs/1610.02483, 2016.
https://arxiv.org/abs/1610.02483.
[23] R. M. Gray, “Vector quantization,” IEEE Acoust, Speech, Signal
ProcessingMagzine,pp.4–29,Apr.
[24] J.SivicandA.Zisserman,“VideoGoogle:Atextretrievalapproach
toobjectmatchinginvideos,”inICCV,Oct.2003.
[25] A. Babenko and V. Lemptisky, “The inverted multi-index,” in
CVPR,pp.3069–3076,2012.
[26] S. Arya and D. Mount, “ANN Searching Toolkit.”
ftp://ftp.cs.umd.edu/pub/faculty/mount/ANN/.
[27] A.BabenkoandV.Lemptisky,“Theinvertedmulti-index,”Trans.
PAMI,vol.37,no.6,pp.1247–1260,2015.