Table Of ContentScalable Multi-Database Privacy-Preserving Record
Linkage using Counting Bloom Filters
Dinusha Vatsalan, Peter Christen, and Erhard Rahm†
ResearchSchoolofComputerScience,TheAustralianNationalUniversity
CanberraACT0200,Australia
†Universita¨tLeipzig,Institutfu¨rInformatik,04109,Leipzig,Germany
{dinusha.vatsalan, peter.christen}@anu.edu.au,†[email protected]
7
1 ABSTRACT that allow the early detection of infectious diseases before
0 they spread widely around a country or even worldwide.
Privacy-preservingrecord linkage(PPRL) aims at integrat-
2 Such an application requires data to be integrated across
ing sensitive information from multiple disparate databases
several sources, including human health data, travel data,
n of different organizations. PPRL approaches are increas-
consumed drug data, and even animal health data [6]. A
a inglyrequiredinreal-worldapplicationareassuchashealth-
J care, national security, and business. Previous approaches second contemporary motivating example is national secu-
rity applications that integrate data from law enforcement
5 havemostlyfocusedonlinkingonlytwodatabasesaswellas
agencies, Internet service providers, businesses, as well as
the use of a dedicated linkage unit. Scaling PPRL to more
financialinstitutionstoenabletheaccurateidentificationof
] databases (multi-party PPRL) is an open challenge since
B crime and fraud, or of terrorism suspects[25].
privacythreatsaswellasthecomputationandcommunica-
D tion costs for record linkage increase significantly with the Intheabsenceofuniqueentityidentifiersinthedatabases
that are to be linked, personal identifying attributes (such
. numberofdatabases. Wethusproposetheuseofanewen-
s as names and addresses) need to be used for the linkage.
coding method of sensitive data based on Counting Bloom
c Known as quasi-identifiers (QIDs) [41], such attribute val-
Filters (CBF)toimproveprivacyformulti-partyPPRL.We
[
ues are in general assumed to besufficiently well correlated
alsoinvestigateoptimizationstoreducecommunicationand
1 computation costs for CBF-based multi-party PPRL with with entities toallow accuratelinkage. Usingsuch personal
v and without the use of a dedicated linkage unit. Empirical information across different organizations, however, often
2 evaluations conducted with real datasets show the viability leads to privacyand confidentialityconcerns. This problem
3 of the proposed approaches and demonstrate their scalabil- has been addressed through the development of ‘privacy-
2 ity,linkage quality,and privacy protection. preserving record linkage’ (PPRL) [43] techniques. PPRL
1 aims to conductlinkage usingonly masked (encoded)QIDs
0 without requiring any sensitive or confidential information
Keywords
. tobeexchangedandrevealed betweentheorganizations in-
1
Recordlinkage,similarity,privacy,multi-party,communica- volved in the linkage. Generally, masking is conducted on
0
tion patterns,secure summation QIDs to transform the original values such that a specific
7
functional relationship exists between the original and the
1
: 1. INTRODUCTION masked values [41]. While there have been many different
v approaches proposed for PPRL (as reviewed in [43]), most
Awiderangeofreal-worldapplications,includinginhealth-
i workthusfarhasconcentratedonlinkingrecordsfromonly
X care, government services, crime and fraud detection, na-
two sources (or parties). As the healthcare and national
r tional security, and businesses, require person-related data
security examples described above show, linking data from
a from multiple sources held by different organizations to be
several sources is however commonly required in practical
integrated or linked. Integrated data can then be used for
applications.
data mining and analytics to empower efficient and qual-
ThedrawbackofthesmallnumberofexistingPPRLsolu-
ity decision making with rich data. Integrating data helps
tions that can link data from multipleparties is that either
improving the quality of data by identifying and resolving
(1) they only support exact matching (which classifies sets
conflicts in data values, enriching data with additional de-
of records as matches if their masked QIDs are exactly the
tailed information, and dealing with missing values[2].
sameandasnon-matchesotherwise)[14,17]or(2)theyare
Theanalysisandminingofdataintegratedacrossorgani-
applicable to QIDs of categorical data only [12, 22]. How-
zationscanbeused,forexample,inhealthoutbreaksystems
ever, in many PPRL applications QIDsof stringdata, such
as names and addresses, are required. These QIDs often
contain errors and variations which necessitates the use of
approximate comparison functions (that are computation-
ally expensive in terms of the number of comparisons) for
comparing QIDs. In this paper, we tackle the multi-party
PPRLproblembydevelopinganefficientprivacy-preserving
Thisisanextended versionofanarticle publishedinIEEEInternational
approach for approximate matching of (masked) QIDs of
Conference on Data Mining (ICDM) International Workshop on Privacy
andDiscriminationinDataMining(PDDM)2016[42]. string data from multiplerecords.
.
D D D D D D D D
1 2 3 4 1 2 3 4
p p p p p p p p
1 2 3 4 1 2 3 4
b (n / b) b (n2 / b2 ) b (n3 / b3 ) b (n4 / b4 )
b (n / b) b (n / bb) (n4 / bb4 )(n / b) bCCr oe(ancmno /pdr dabidr )saiestotesn b (n4 / b4 ) Candidaterecord sets
Comparison
Linkage Unit (LU) Linkage Unit (LU)
Figure 1: An overview of traditional na¨ıve comparison of candidate record sets masked using BFs (left) and CBFs
(right, as will be described in detail in Section 3) from p = 4 parties using a LU. Databases are indexed/blocked to
reduce the number of candidate record sets such that only records in the same blocks are compared and classified.
Blocks are illustrated by different patterns in Di, with 1 ≤ i ≤ p. n denotes the number of records in the databases
(assuming all databases are of equal size) and b denotes the number of blocks generated (assuming blocks of equal
size). Independent of the masking function and the communication pattern used (i.e. BFs and direct one-to-one
communication in the left figure, and CBFs and ring-based communication in the right figure), the na¨ıve approach
results in exponential complexity of b (n4/b4) candidate sets each consisting of p=4 records (one from each party).
We propose the use of Counting Bloom Filter (CBF) en- pairs haveto becompared andclassified [43]. Compared to
coding, which is a variation of Bloom filter (BF) encod- thequadraticnumberofrecordpairswhenlinkingonlytwo
ing[43],toenableefficientandapproximateprivacy-preserving databases(O(n2)),inmulti-partyPPRLthenumberofcan-
linkageofmultipledatabases. BFsarebitvectorsintowhich didate record sets increases exponentially with the number
valuesarehash-mappedusinghashfunctions(aswedescribe of parties (O(np)), and thus using existing private blocking
in Section 2.1). CBFs, on the other hand, are integer vec- techniqueswouldnotsufficientlyreducethenumberofcom-
torsthatcontaincountvaluesineachposition. MultipleBFs parisons, as has been empirically studied in several recent
can be summarized as a single CBF using the vector addi- approaches [26, 27, 39].
tion operation between BFs. Previous BF encoding-based Figure 1 overviews the na¨ıve computation and commu-
PPRLapproaches[8,33]suggestusingalinkageunit(LU), nication of (masked) candidate record sets from multiple
which is a dedicated external party that can perform link- parties (p=4) using a LU. Independentof the used mask-
age by comparing candidate record sets (masked into BFs) ing function (BFs in the left figure and CBFs in the right
from all database owners and calculating their similarities figure) and the communication pattern (direct one-to-one
to classify them as matches or non-matches. Our hypoth- communication between each party and the LU in the left
esis is that, rather than sending BFs from all parties to a figure and ring-based communication among the parties in
linkageunit(LU)tocalculatetheirsimilarity,asingleCBF therightfigure,aswillbedescribedinSection2),thena¨ıve
foreachcandidaterecordsetgeneratedoverppartiescanbe approach results in exponential complexity. Efficient com-
used to calculate their similarity, as illustrated in Figure 1 munication patterns and advanced filtering approaches for
(left for BFsand right forCBFs). SinceCBFs contain only multi-partyPPRLthereforeneedtobedevelopedinorderto
thesummaryinformation(countvalues)ofmultiplerecords reduce the potentially huge number of comparisons. More-
intheirpositionsratherthantheactualindividualbitvalues over, with multiple parties the privacy risk of collusion in-
of a single record as in BFs, they provideincreased privacy creases, where a sub-set of parties collude among them in
comparedtoBFs,aswediscussinSection6. Tothebestof order tolearn about anotherparty’s(or sub-set of parties’)
our knowledge, this privacy aspect of CBFs has so far not privatedata. Bothexamplesforthena¨ıvemethoddescribed
been utilized in PPRL. in Figure 1are highly susceptibleto collusion.
An additional challenge with multi-party PPRL is that In order to overcome these scalability and privacy chal-
complexity increases significantly with multiple parties in lengesofmulti-partyPPRL,weintroducetwoefficientCBF-
termsofbothcomputationaleffortsandcommunicationvol- based communication patterns that either use a LU or op-
ume. Abasicapproachwouldbetosendallmaskedrecords eratesymmetricallywithoutaLU (whereatrustedexternal
from allppartiestoaLU thatcancalculatepair-wisesimi- party is not available to act as a LU). The proposed ap-
larities betweenmaskedrecords,whichisofO(p2·n2)com- proachescansignificantlyreducethenumberofcomparisons
plexity,wherenisthesizeofdatabasesassumingalldatabases required between records in contrast to the na¨ıve all-to-all
are of equal size. However, identifying a matching set of comparisons, andtherebyimprovethescalability whilealso
records from all p parties is not possible with such a ba- improvingtheprivacy(reducingthelikelihoodofcollusions)
sic pair-wise comparison approach. On the other hand, the byarrangingpartiesintoseveralgroupsandbydistributing
numberofna¨ıve(all-to-all)comparisonsbetweenrecordsre- computations among parties.
quired across all p databases (D , D , ···, D ) is equal to
1 2 p
theproductofthesizeofthedatabases(i.e.n×n×···n= Contributions: Ourcontributionsinthispaperare: (1)
np). Addressing this complexity challenge, two-step algo- a novel multi-party PPRL protocol based on CBFs and se-
rithmshavebeendevelopedwhereinthefirststepaprivate curesummationforefficient,approximate,andprivatelink-
indexing/blockingtechniqueisusedtoreducethenumberof age;(2)twovariationsofextendedsecuresummationproto-
candidaterecordsetsfromnp tob·np/bp,assumingbblocks cols for improved privacy against collusion among the data
ofequalsize. Inthesecondsteponlythesecandidaterecord base owners: (a) homomorphic encryption-based and (b)
salting-based (usingrandomseedintegers); (3)twoefficient
pe et te er
communication patterns (with and without a LU) for re- x = 7
1
ducing the comparison space and risk of collusion between x = 5
parties and thereby improving scalability and privacy, re- BFs b1 1 0 1 0 0 0 1 1 1 0 0 1 0 1 z2 = 5
spectively,in multi-partyPPRL;(4) an analysis of thepro- a) b2 1 0 1 0 0 0 1 1 0 0 0 1 0 0 Dice_sim = 2 x 5
tocol in terms of the three properties of PPRL: scalability ( (7+5)
= 0.83
(complexity), linkage quality, and privacy; and (5) an em- pe et te
pirical evaluation and comparison of our protocol with two z = 5
baseline approaches using large North Carolina Voter Reg- CBF c 2 0 2 0 0 0 2 2 1 0 0 2 0 1 x1+x2= 12
istration (NCVR)[4] datasets. b) Dice_sim = 21 x2 5
Outline: Inthefollowingsectionwedescribetheprelim- ( = 0.83
inaries. InSection3weproposeourprotocolformulti-party
PPRLbasedonCBFsandsecuresummation,whereinSec- Figure 2: Similarity calculation of QIDs of two records
tion4weweproposetwoextendedsecuresummationproto- maskedusing (a)BFsand(b) CBFs, wherel=14, k=2,
colstoimproveprivacyofourapproach,andinSection5we and q=2.
introduce two efficient communication patterns to improve
scalability and privacy. We analyze our protocol in terms
ofcomplexity,linkagequality,andprivacyinSection6,and more expensive with regard to the computation and com-
in Section 7 we conduct an empirical study on the NCVR munication complexities though it provides strong privacy
datasets to validate these analyses. We provide a review of guarantees and high accuracy [18]. The latter uses efficient
relatedworkinSection8. Finally,wesummarizeanddiscuss techniques and, as opposed to SMC techniques, these tech-
future research directions in Section 9. niques aim to hide (mask) information about the original
values (to preserve privacy) while still allowing to perform
2. PRELIMINARYCONCEPTSANDBUILD- approximatematchingbetweenthemaskedvaluesusingthe
functional relationship between original and masked data.
INGBLOCKS
Weproposeanefficientprotocolformulti-partyPPRLus-
Inthissection,wedefinetheproblemofmulti-partyPPRL ingperturbation-basedmasking. Inthissection,wedescribe
andexplainhowCBFscanbeusedforefficientlycalculating thefourbuildingblocksofourprotocol,andinSection3we
similarities (approximate matching) of QID values between presentouralgorithmindetail. Inthefollowingtwosections
a set of multiple (two or more) records (held by different we assume a linkage unit (LU) is available to conduct the
parties) in PPRL. linkage,andinSection5weproposeavariationwhereaLU
WeassumepdatabaseownersP1,P2,···,Ppwiththeirre- is not required to conduct thelinkage usingour protocol.
spectivedatabasesD ,D ,···,D (containingsensitiveor
1 2 p
confidential identifying information) participate in the pro- 2.1 Bloom filter encoding
cess under the honest-but-curious(HBC) [43]. In the HBC
Bloom filter (BF) encoding has been used as an efficient
model, parties are assumed to follow the protocol without
masking(encoding)techniqueinseveralPPRLsolutions[9,
deviatingorsendingfalseinformationwhilebeingcuriousto
17, 26, 34, 35, 38, 39, 40]. A BF b is a bit array data
i
learn about other parties’ data. However, the HBC model
structure of length l bits where all bits are initially set to
doesnotassumethatthepartiesdonotcolludeamongthem
0. k independent hash functions, h ,h ,...,h , each with
1 2 k
tolearnaboutotherparties’data[19]. Wequantifytherisk
range 1,...l, are used to map each of the elements s in a
ofcollusioninmulti-partyPPRLandthereductionofriskby
set S into the BF by setting the bit positions h (s) with
j
ourcommunicationpatternsinSection6.2. Wealsoassume
1≤j ≤k to1.
asetofQIDattributesA,whichwillbeusedforthelinkage,
Schnell et al. [34] were the first to propose a method for
is common to all these databases. We formally define the
approximatematchinginPPRLoftwodatabasesusingBFs.
problem of PPRL of multipledatabases as follows.
Intheirwork,asinourprotocol,thecharacterq-grams(sub-
Definition 2.1. Multi-partyPPRL:AssumeP ,...,P stringsoflengthq)ofQIDvaluesinAofeachrecordinthe
1 p
databases to be linked are hash-mapped into a BF using k
are the p owners (parties) of the databases D ,...,D , re-
1 p
independent hash functions. This method of BF encoding
spectively. They wish to determine which of their records
is known as Cryptographic Long term Key (CLK) encod-
R ∈ D , R ∈ D , ..., R ∈ D match based on the
1,i 1 2,j 2 p,k p ing [34].
(masked)QIDsoftheserecordsaccordingtoadecisionmodel
TheseBFsaretheneithersenttoaLU thatcalculatesthe
C(R ,R , ..., R ) that classifies record sets (R ,R ,
1,i 2,j p,k 1,i 2,j similarity ofpairsofBFs,assuggestedbySchnelletal.[34]
...,R )intooneofthetwoclassesMofmatchesandUof
p,k andDurhametal.[9],ortheyarepartiallyexchangedamong
non-matches. Assuming the HBC adversary model, parties
thedatabaseowners todistributivelycalculate thesimilari-
P ,...,P arehonest,inthattheyfollowthelinkageprotocol
1 p tiesofBFpairs/sets,asproposedbyLaietal.[17]andVat-
steps, while they do not wish to reveal their actual records
salan and Christen for two-party [38] and multi-party [39,
R ,...,R with any other party. They however are pre-
1,i p,k 40] approaches. Figure 2(a) illustrates the encoding of bi-
pared to disclose to each other, or to an external party, the
grams (q = 2) of two QID values ‘peter’ and ‘pete’ into
actual values of some selected attributes of the record sets
l=14 bits long BFs usingk=2 hash functions.
that are in class M to allow analysis.
2.2 Dicecoefficient
Maskingfunctionsusedforprivacy-preservingalgorithms
can be categorized into two: cryptographic-based secure Any set-based similarity function (such as overlap, Jac-
multi-partycomputation(SMC)techniquesandperturbation- card,andDicecoefficient)canbeusedtocalculatethesimi-
based techniques [41]. The former approach is generally larityofpairsorsetsof(multiple)BFs. TheDicecoefficient
has been used for matching of BFs since it is insensitive to Proof. The Dice coefficient similarity of p BFs (b ,b ,
1 2
manymatchingzeros(bitpositionstowhichnoq-gramsare ···, b )isdetermined bythesum of1-bits(Pp x )in the
p i=1 i
hash-mapped) in long BFs [34]. In future, we aim to inves- denominator of Eq. (1) and the number of common 1-bits
tigate how other approximate string comparison functions, (z) in all p BFs in the nominator of Eq. (1). The number
suchaseditdistance[30]andJaroandWinkler[10,45],can of 1-bits in a BF b is x = b [1]+b [2]+···+b [l], with
i i i i i
be extended to calculate the similarity of more than two 1 ≤ i ≤ p. The sum of 1-bits in all p BFs is therefore
values. Pp x = Pp b [1]+b [2]+···+b [l]. The value in a
i=1 i i=1 i i i
bit position β (1 ≤ β ≤ l) of the CBF of these p BFs is
Definition 2.2. WedefinetheDicecoefficientsimilarity c[β]=b1[β]+b2[β]+···+bp[β]. Thesumofvaluesinallbit
of p BFs (b ,··· ,b ) as: positions of the CBF is Pl c[β] = Pl b [β]+b [β]+
1 p β=1 β=1 1 2
···+b [β] which is equal to Pp x =Pp b [1]+b [2]+
p×z p i=1 i i=1 i i
Dice sim(b1,···,bp) = Ppi=1xi (1) ·in··a+llbpi[lB].FFsu,ri.teh.e∀r,pi=if1abi[bβit] =po1si,titohnenβc([1β]≤=βP≤pil=)1cboin[βta]i=nsp1.
where z is the number of common bit positions that are set Therefore, the common 1-bits (z) that occur in all p BFs
to 1 in all p BFs (common 1-bits), and x is the number of canbecalculatedbycountingthenumberofpositionsβ ∈c
bit positions set to 1 in bi (1-bits), 1≤i≤i p. wherec[β]=p,whilethesumofnumberof1-bits(Ppi=1xi)
is calculated by summing the values in bit positions β ∈ c,
Pl c[β].
Figure2(a)illustratestheDicecoefficientsimilaritycalcu- β=1
lation of the two QID values ‘peter’ and ‘pete’ masked into IftheLU gets onlytheCBFcthat containsthesummed
BFs. values in the bit positions of p BFs instead of the actual
p BFs, then the LU can calculate the Dice coefficient of p
2.3 Secure summation
BFs using Eq. (2) without learning any information about
A secure summation protocol [7] can be used to securely the individual bit positions of each party. As an example,
sum the input values of multiple parties (p≥ 3), such that theDicecoefficientcalculation ofthetwoBFsfrom thetwo
no party learns the individual values of other parties, but parties P and P in Figure 2(a) is extended by using a
1 2
only the summed value. The input can either be a single CBF and secure summation to calculate the similarity by
numericvalueoravectorofnumericvalues. Forexample,p theLU,asshowninFigure2(b). Theinformationgainedby
numeric values (v1,v2,··· ,vp) can be securely summed by theLU (and/or databaseowners) from asingle CBF is less
using a random numeric value r′ which is sent by a LU. thanpBFs(i.e.CBFsprovideincreasedprivacycomparedto
ThefirstpartyP1 thatreceivesr′ calculatesthesummation BFs), as theoretically provenin Section 6.2 and empirically
of r′+v1 and sends to P2. This process is repeated until validated in Section 7.
the last party sends r′ +v+1+v +···+v to the LU
2 p
whichthensubtractsr′ fromthesummedvaluetocalculate 3. MULTI-PARTY PPRLALGORITHM
the summation of p values. The protocol employs a ring-
Our protocol allows efficient, approximate, and private
based communication pattern over all parties which allows
theLU tolearn thefinalvalues(Pp v )butnopartywill linkingofrecordsbasedontheirmaskedQIDvaluesinmul-
i=1 i tiple databases from p (≥ 2) sources/parties. We first de-
learn the values v of the other parties. This basic secure
i scribeabasicna¨ıveprotocolbasedonCBFs,whichwename
summation (BSS) protocol is susceptible to collusion risk
as‘NAI’,andinSection5wethenproposeimprovedcommu-
amongtheparties. InSection4weproposeextendedsecure
nicationpatternsforthisprotocoltomakeitmorescalable.
summationprotocolsforimprovedprivacyagainstcollusion.
The stepsof ourprotocol are listed below.
2.4 Counting Bloom filters (CBFs)
• Step 1: The parties agree upon the following pa-
In this section we propose a novel method of calculat- rameter values: the BF length l; the k hashing func-
ing thesimilarity of p BFs usingaCBF, which will provide tions h ,...,h to be used; the length (in characters)
1 k
increased privacy compared to using BFs for similarity cal- of grams q; a minimum similarity threshold value s
t
culation, as we will discuss in Section 6.2. A CBF c of p (0 ≤ s ≤ 1), above which a set of records is classi-
t
(p > 1) BFs is an integer array data structure of the same fied as a match; a private blocking function block(·);
length as the BFs (l). It contains the counts of values in theblockingkeys[2]B used for blocking;anda set of
each bit position β,1 ≤ β ≤ l over a set of p BFs in its QIDattributes A used for thelinkage.
corresponding position, such that c[β] = Pp b [β], where
i=1 i • Step 2: Each party P (1 ≤ i ≤ p) individually
c[β]isthecountvalueintheβbitpositionoftheCBFcand i
applies a private blocking function [43] block(·) to re-
b [β] ∈ [0,1] provides the value in the bit position β of BF
i
ducethenumberof candidate sets of records C (from
b . Given p BFs (bit vectors) b with 1 ≤ i ≤ p, the CBF
i i Qpn , where n = |D | is the number of records in
c can be generated by applying a vector addition operation i i i i
D held by party P ), which otherwise becomes pro-
between thebit vectors such that c=P b . i i
i i hibitiveevenformoderaten andp. Theblock(·)func-
i
tiongroupsrecordsaccordingtotheirblockingkeyval-
Theorem 2.1. The Dice coefficient similarity of p BFs
ues (BKVs) [3] and only records with the same BKV
can be calculated given only a CBF as:
(i.e. records in the same block) from different parties
p×|{β :c[β]=p,1≤β ≤l}| are then compared and classified using our protocol.
Dice sim(b ,··· ,b )=
1 p Pl c[β] A phonetic-based blocking [2] or any of the existing
β=1 multi-partyblockingtechniquesforPPRL[26,27]can
(2) beused as theblock(·) function.
Algorithm 1: SecuresummationofBFvectors. Algorithm2: Similaritycalculationofrecordsets.
Input: Input:
-DMi : PartyPi’srecordsmaskedintoBFsbi,1≤i≤p -R′: ListofrandomvectorsusedbypartyP1 ortheLU for
-R′: ListofrandomvectorsusedbypartyP1 ortheLU securesummationofBFs
forsecuresummationofBFs -C: CandidaterecordsetswithCBFs
Output: (includingrandomvectors)frompartyPp
-C: CandidaterecordsetswithCBFs -st: Minimumsimilaritythresholdtoclassifyrecordsets
Output:
1: C={} //initializeC
-M: Matchingrecordsets
2: for1≤i≤pdo:
3: if i=1then: //firstparty 1: M={} //initializeM
54:: forcr=ecR∈′[rDecM1]+doD:M[rec] //summation 23:: Cfo.rreccse∈iveCfdroo:m(Pp) //receiveCBFs
6: C[rec]=c 1 4: c=C[cs]−R′[cs] //subtract
78:: elsCe:.sendto(P2) ////soetnhderCpatrotiPes2 5: Dicesim(cs)= p×|{β:Pc[lββ=]=1pc,[1β≤]β≤l}| //Eq.(2)
9: C.receivefrom(Pi−1) //receivefromPi−1 76:: ifMDi.caeppseimnd((c[sc)s,≥Dsictethsiemn:(cs)]) //amatch
10: forcs∈Cdo:
11: forrec∈DM do: 8. for1≤i≤pdo:
12: c=C[cs]i+DM[rec] //addition 9: M.sendto(Pi)
i
13: cs=cs.append(rec)
13: C[cs]=c
14: if i6=pthen:
15: C.sendto(Pi+1) //sendCtoPi+1
16: else: tors (i.e. c=R′[rec]+Ppi=1bi−R′[rec]), which is a
17: C.sendto(P1 orLU) //sendCtoP1/LU special caseof vectoraddition,asoutlinedin lines2-4
in Algorithm 2. As shown in lines 5-7, the LU or P
1
then calculates the Dice coefficient similarity of each
resulting CBF c following Eq. (2) to classify the com-
• Step 3: Each party P hash-mapsthe q-gram values paredsetsofrecords(BFs)withinablockintomatches
i
of QIDsA of each of its ni records in their respective andnon-matchesbased on thesimilarity threshold st.
databases D into n BFs (one BF per record in D ) The final similarities of matching sets of records will
i i i
of length l using the hash functions h1,...,hk. It is besenttoallpartiesPi,with1≤i≤p,inlines8-9in
crucial to set the BF related parameters in an opti- Algorithm 2.
mal way that balances all three properties of PPRL
The basic secure summation protocol (BSS) used in this
(complexity,quality,and privacy). Wefurther discuss
’NAI’protocolisvulnerabletocollusionriskamongthepar-
parametersettingforBFsusedinourprotocolinSec-
ties. Further,the numberof candidatesets to be compared
tion 6.
for multi-party linkage in this na¨ıve method (NAI) is ex-
• Step 4: Inthenextstep,thepartiesinitiateasecure ponential in the number of parties and their dataset sizes,
summation protocol to securely perform vector addi- which is prohibitively large to be practically feasible (even
tionbetweentheirBFsinordertogenerateaCBFfor with the existing private blocking and filtering approaches
each set of candidate records C. This secure summa- employed) [39]. Efficient communication patterns among
tion can be initiated by an external linkage unit LU the parties therefore need to be employed in order to make
that provides a vector (of length l) of random values multi-party PPRL scalable and viable in real applications
orbyoneoftheparties(aswillbediscussedfurtherin thatrequiredataofverylargesizesfrommanypartiestobe
Section 5). This process is outlined in Algorithm 1. integrated. In the following sections we propose extended
secure summation protocols and two efficient communica-
Inlines3-6,theLU orthepartythatinitiatedthecom-
tion patterns that not only improve the scalability of our
munication(weassumeP )sendsoruses,respectively,
1 multi-party PPRL protocol but also make it more secure
a random vector R′[rec] for each record rec∈DM to
1 (with less possible collusion among theparties).
sum with the party’s BF vector b ∈ DM (i = 1) by
i i
applying a vector addition operation. The summed
4. EXTENDED SECURE SUMMATION
vectorsR′[rec]+b ∈CarethensenttopartyP in
i i+1
line7. PartyP ,2≤i≤preceivesthesummedvectors The basic secure summation protocol (BSS) described in
i
from P (line9) and addsits BF vectorb ∈DM to Section 2.3 is susceptible to collusion risk by the parties
i−1 i i
eachcandidatesetcs∈Candsendsthesummedvec- involved,where if two ormore parties collude theyare able
tors to the next party P . This process is repeated toinfertheinputofanotherparty. Inordertoovercomethis
i+1
untilthelast party(i.e. P ) addsits b vectorto each risk (to improve privacy), we propose two extended secure
p p
received summedvectorR′[rec]+Pp−1b from party summation protocols.
i=1 i
P and sends the final summed vector back to the
p−1
• Homomorphic-basedsecuresummation(HSS):Thepar-
LU orP foreachcandidatesetcs,asshowninlines8-
1
tiallyhomomorphicPailliercryptosystem[24]isareli-
17 in Algorithm 1.
ablesecuremulti-partycomputation(SMC)technique
• Step 5: Finally, either theLU or the first party,P , for performing secure joint computation among sev-
1
that has provided the random vectors R′, subtracts eral parties. In the HSS approach a pair of private
R′[rec]fromthefinalsummedvectorR′[rec]+Pp b and public keys is used for encrypting and decrypt-
i=1 i
as received from the last party P for each candidate ing the individual BF values. Successive encryption
p
set cs ∈ C. This is achieved by subtraction of vec- ofthesamevalueusingthesamepublickeygenerates
Ring 1 Ring 2 Ring 3 Ring 4
p1 2 p2 Matchesring 1 6 p3 7 p4 Matchesring 1,2 11 p5 12 p6 Matchesring 1,2,3 14 p7 16 p8 Matches
8 13 17
3
1 4 5 9 10 14 15 18
Rvaecntdoorms Comparison Rvaecntdoorms Comparison Rvaecntdoorms Comparison Rvaecntdoorms Comparison
Linkage Unit (LU)
Figure3: Sequential(SEQ)communicationpatternforcomparisonofCBFsfromp=8partiesusingaLU,asdescribed
in Section 5.1. In this example, the minimum number of parties per ring is set to rm =2. Communication steps are
shown as circled numbers.
different encrypted values with high probability, and promisingtodeterminepartialmatchesforasub-setofpar-
decrypting the encrypted values using a private key ties and consider additional parties only for these partial
returns the correct original value. The public key is matches.
knowntoallpartieswhiletheprivatekeyisknownonly The parties are first arranged into rings of size r (with
to the LU. Each party P receives summed vectors r ≤ p) based on the value for the parameter r , the mini-
i m
containing encrypted values to which P adds its en- mum numberof parties perring (r≥r ). The value of r
i m m
cryptedb vector(usingthepublickey)andsendsthe needs to be carefully chosen, as it has a trade-off between
i
encryptedsummedvectorstothenextparty. Without scalability (complexity) and privacy. The higher the value
knowingtheprivatekeyapartyP cannotdecryptthe forr is,thebettertheprivacyoftheprotocol becausethe
i m
received vectors and therefore colluding with a party resultingCBFsaremoredifficulttoexploitwithaninference
to learn another party’s b (with i6=j) would be im- attack (by mapping the CBFs to known values in a global
j
possible. databasetoinfertheirunderlyingunencodedvalues),aswill
be discussed in Section 6.2. On the other hand, higher val-
• Salting-based(usingrandomseedintegers)securesum- uesofr resultsinlowerscalabilitytolargedatasetsacross
m
mation(SSS):TheHSS approachprovidesasecureso- many parties because the number of comparisons required
lutioncomparedtotheBSS approachagainstcollusion per ringexponentially increases with thering size r.
attacks at the cost of an excessive computation over-
head, making it not scalable to linking multiple large 5.1 Sequential communication
databases. Therefore, we propose the SSS approach
Inthisfirstproposedcommunicationpattern(SEQ),which
to provide security against collusion attacks in an ef-
requiresa LU, thematchesfound in onering are compared
ficient way. Salting has been used to defend against
withthecandidatesetsinthenextringresultinginaset of
dictionary attacks on one-way hash functions where
matches from both rings which will then be compared with
anadditionalstringisconcatenatedwithavaluetobe
the candidate sets in the following ring, and so on. This
encrypted[32]. WeadoptasimilarconceptintheSSS
communicationiscarriedoutsequentiallyuntilthematches
approachwherethesaltingkeyisanadditionalrandom
from all rings are found by theLU. Figure 3illustrates the
integer used by each party P individually when per-
i
SEQ approachforfourringswithr=2partiesineachring.
formingthesecuresummation. Thesaltingkeygener-
Algorithm 3 details thesteps of the SEQ communication
atedandusedbyeachpartyissentonlytotheLU and
pattern. First the parties are grouped in rings using the
therefore a party P ’s BF values cannot be identified
i
function group(·)(line2 in Algorithm 3). Different number
by means of collusion among the parties, as without
of parties (≥ r ) can be grouped into different rings. The
knowing the salting key of P its BF values cannot m
i
valueforr canbeagreeduponbythepartiestoanyvalue
be learned. Since the salting keys are random integer m
r ≥ 2 at the trade-off between privacy and scalability, as
values, performing secure summation of BFs with the m
will bediscussed in Section 6.
salting keysdoes not add any additional computation
Tominimizethenumberofcomparisonsthatarerequired,
and communication overhead, except thecommunica-
the grouping of parties into rings is ideally done in an as-
tion of salting keysfrom parties tothe LU.
cending order of the size of their datasets. In this way, the
first ring will generate a smaller number of matches, which
5. COMMUNICATIONPATTERNS
then have to be compared with the candidate sets in the
In this section, we propose two variations of improved following rings. Thisreduces thecomputationalcomplexity
communicationpatternsforourprotocolbasedonCBFsfor compared to initially larger number of matches being gen-
multi-partyscenarioswithandwithoutalinkageunit(LU). eratedbythefirstringifthepartiesinthefirstringarethe
Themainideaoftheseimprovedcommunicationpatternsis ones with thelargest databases.
to exploit the facts that most candidate sets are true non- A loop is iterated over rings in line 3 of Algorithm 3 and
matches(duetotheclassimbalanceproblemofrecordlink- then the parties in rings are iterated in line 4. Each party
age [3]), and that a true matching set must have a high in ring retrieves its records in line 5 and a loop is iterated
similarity betweenanysub-setofrecordsinthatset. Hence over these records in line 7 to append each of them to ev-
for multi-party PPRL applications with many parties it is erycandidaterecordset(frompreviousparty,exceptforthe
Ring 1 Ring 2 Ring 3
p 2 p p 2a p p 2b p
Phase 1 11 32 14a 35a 17b 3b8
p p p
3 6 9
4 5
hase 2 Mraintcgh 1es Mraintcgh 2es Mraitncgh e3s
P 6
Matches
Figure 4: Ring by ring (RBR) communication pattern for comparison of CBFs from p= 9 parties without a LU, as
described in Section 5.2. In this example, the minimum number of parties per ring is rm =3. Communication steps
are shown as circled numbers.
Algorithm 3: ComparisonofCBFsusingSEQ (byLU). Algorithm4: ComparisonofCBFsusingRBR(withoutLU).
Input: Input:
-DMi :PartyPi’srecordswithRIDsandBFs,1≤i≤p -DMi :PartyPi’srecordswithRIDsandBFs,1≤i≤p
-rm: Minimumringsize,withrm≥2 -rm:Minimumringsize,withrm≥3
-st: Minimumsimilaritythresholdtoclassifyrecordsets -st: Minimumsimilaritythresholdtoclassifyrecordsets
Output: Output:
-M: Matchingrecordsets -M:Matchingrecordsets
1: C={[]};M={[]} //initializeC,M 1: rings=group([P1,P2,···,Pp],rm) //groupparties
2: rings=group([P1,P2,···,Pp],rm) //groupparties 2: forring∈ringsdo: //phase1
3: forring∈ringsdo: //iteraterings 3: Cring={[]};Mring={[]};r=len(ring)
4: fori∈ringdo: //iterateparties 4: fori∈ringdo: //iterateparties
5: records=DMi .values() //partyPi’srecords 5: records=DMi .values()
6: forrecset∈Cdo: 6: forrecset∈Cring do:
7: forrec∈recordsdo: 7: forrec∈recordsdo:
8: recset.append(rec) //Cofthering 8: recset.append(rec) //Cinthering
9: C′=secsum(C,ring) //summation 9: C′ring=secsum(Cring,ring) //summation
10: forcs∈C′ do: 10: forcs∈C′ring do:
11: c=C′[cs] 11: c=C′ring[cs]
12: Dicesim(cs)= p×|{β:c[β]=p,1≤β≤l}| //Eq.(2) 12: Dicesim(cs)= r×|{β:c[β]=r,1≤β≤l}| //Eq.(2)
Plβ=1c[β] Plβ=1c[β]
13: if Dicesim(cs)≥st then: //amatch 13: if Dicesim(cs)≥st then: //amatch
14: M.append(cs) 14: Mring.append(cs) //Minthering
15: C=M 15: C={[]};M={[]} //initialize
16: forring∈ringsdo: //phase2
17: matches=Mring.values()
18: formatchset∈Cdo:
19: formatch∈matchesdo:
20: matchset.append(match) //Cinallrings
firstpartythatappendstoemptysets)storedinC(line8). 21: C′=secsum(C,[P1,P2,···,Pp]) //summation
22: forcs∈C′ do:
Next,asecuresummationprotocolisappliedinline9using 23: c=C′[cs]
sec sum(·) function on the candidate sets of BFs C identi- 24: Dicesim(cs)= p×|{β:c[β]=p,1≤β≤l}| //Eq.(2)
fied in ring in order to generate CBFs for each candidate Plβ=1c[β]
set. In lines 10-12, each CBF c generated is then used to 2265:: ifMDi.caeppseimnd((ccss))≥st then: ////Mamiantcahllrings
calculatetheDicesimilarity(Dice sim)ofthecandidateset
(cs) andif theDice sim(cs)≥s (line13) thencs is added
t
into the list of matches M identified in the ring (line 14),
which will then be used as an input (line 15) in the next
ring.
ing CBFs from multiple parties without using a LU. This
The risk of collusion between parties in order to identify
method is illustrated in Figure 4 for three rings with r =3
data about another party can be reduced in this approach
parties in each ring, while Algorithm 4 outlines the steps
by using different BF encodings in different iterations. For
of RBR in detail. Similar to SEQ, parties are grouped into
example, if the encoding used in ring 1 in Figure 3 is un-
rings using the group(·) function (line 1 in Algorithm 4).
known to parties in ring 2, then the collusion between the
Thevalueforr inRBR shouldbesettor ≥3,asamin-
LU andpartiesinring2wouldnotrevealsufficientinforma- m m
imum of three parties are required in each ring to perform
tion to infer the actual values masked in the BFs in ring 1.
secure summation without a LU.
Hence,parties might wish tobegrouped with certain other
The RBR method consists of two phases. In the first
parties in thesame ring for better privacyprotection.
phase (lines 2-14 in Algorithm 4), the parties in each ring
(lines 2-4) individually perform secure summation among
5.2 Ring by ring communication
them using sec sum(·) (line 9) on their sets of candidate
IntheabsenceofatrustedLU,asisrequiredbytheprevi- recordsC (generatedinlines5-8)togeneratetheCBFs
ring
ously described SEQ communication approach, we propose C′ , and calculate their similarities to identify matches
ring
a ring by ring communication pattern (RBR) for compar- M in each ring (as shown in lines 10-14). Inthe second
ring
phase in lines 16-26, all parties then perform secure sum- The communication patternsSEQ and RBR proposed in
mation among them on the matches identified in each ring Sections5.1and5.2,respectively,improveSteps4and5sig-
M in order to identify matches M from all p parties. nificantly (depending on the number of parties per ring, r)
ring
Every ring in the first phase can employ a different set byreducingthecomputationandcommunicationcomplexi-
of BF parameters to reduce the possibility of collusion be- ties. Ingeneralthecomplexitiesarereducedfrom theexpo-
tweenasetofpartiesindifferentrings. Inthesecondphase, nentialgrowthwithpdowntor+1inSEQ andmax(r,p/r)
all parties then have to agree on another set of parameters in RBR.With thesimplified assumption that each ring has
for BF encodings of the matches identified in rings in the rparties,thecomputationandcommunicationcomplexities
first phase. In addition, the rings in the first phase can be of the SEQ and RBR methodsare as follows.
processed independentlyandin parallelin adistributeden- IntheSEQ method,b(n/b)r candidatesetsareprocessed
vironment making the RBR more scalable (than the SEQ) in the first ring and b(n/b)r+1 in each of the remaining
with larger dataset sizes. rings (i.e. b(n/b) matching sets, in the worst case, from the
previous ring are compared with the b(n/b)r sets in each
6. ANALYSISOFTHE PROTOCOL ring), resulting in total computation and communication
complexitiesofO(Pr b(n/b)i +(Pr+1b(n/b)i·(p/r−1))
Inthissectionweanalyzeourmulti-partyPPRLprotocol i=1 i=1
(with r < p). The computation and communication com-
in terms of complexity,privacy,and linkage quality.
plexities of the RBR method is O(Pr b(n/b)i ·(p/r) +
i=1
6.1 Complexity analysis Pp/rb(n/b)i), where the first phase requires b(n/b)r total
i=1
candidate sets to be processed in each of the p/r rings and
Weassumeppartiesparticipateintheprotocol,eachhav-
the second phase compares the b(n/b) matching sets from
ing a database of n records. We assume a private block-
ing/indexingtechniqueemployedintheblockingstepforms each ring.
b ≤ n blocks for each party. In Step 1 of our protocol, Overall,thecomplexityoftheNAI methodisO(b(n/b)p),
theSEQ methodisO(b(n/b)r+1·p/r),andtheRBRmethod
theagreementofparametershasaconstantcommunication
complexity. BlockingthedatabasesinStep2hasO(n)com- isO(max(b(n/b)r·p/r,b(n/b)p/r). Thistheoreticalanalysis
putationcomplexityat eachparty,andfindingtheintersec- shows that the two proposed communication methods SEQ
tionofblocksfromallpartieshasacommunicationcomplex- andRBRarecomputationallyefficientcomparedtotheNAI
ityofO(p·b)andacomputationcomplexityofO(b·logb)at method. Dependingonthevaluesforpandr,theSEQ and
eachparty,asp·bBKVsneedtobesecurelycommunicated, RBR methodsoutperform each other.
andforeachofthebBKVsasearchoperationofO(log b)is ThememorysizerequiredforaCBFis2x,whichis2x ≥p,
required in order to identify the intersection set. Assuming bits for every position in the CBF. If the length of CBF
the average number of q-grams in the QID attributes A of is l, the total memory consumption is l ×⌈log2(p)⌉. For
each record is g, the masking of QID values of records into p BFs the memory required is 1 bit for every position in
BFsoflengthlusingkhashfunctionsfornrecordsinStep3 the BF, and therefore the total memory consumption of
is O(n·g·k) at each party. 1×l×p. CBF requires relatively more memory than us-
Steps4and5consist of thesecuresummation ofBFvec- ing BFs when p is small, however with increasing num-
tors to calculate the CBFs of candidate sets. Ourextended ber of p in multi-party PPRL, a CBF requires significantly
secure summation protocols HSS and SSS aim to improve lower memory compared to BFs. For example, when p=5
privacy at the cost of complexity overhead. The extended and l = 1,000 the respective memory sizes of a CBF and
HSS protocolrequiresn·l·pencryptedvalues(longintegers corresponding BFs are 1000 ×⌈log2(5)⌉ = 3000 bits and
of 4 bytes each) to be exchanged among the parties, while 1×1000×5=5000bits,whileforp=10andl=1,000they
thebasicsecuresummation(BSS)andSSS requireexchang- are1000×⌈log2(10)⌉=4000bitsand1×1000×10=10000
ing n·l·p short integer values (of 2 bytes each), which is bits, respectively.
more efficient compared toHSS.Further,thehomomorphic
6.2 Privacy analysis
encryption and decryption functions used in HSS are com-
putationally expensive compared to simple vector addition As with most of the existing PPRL approaches, we as-
and subtraction operations [19]. sume that all parties follow the honest-but-curious (semi-
Withthesimplifiedassumptionthatallblocksareofequal honest) adversary model [43], where the parties follow the
size n/b, i.e. contain n/b BFs at each party, then using the protocolwhilebeingcurioustolearntheotherparties’data
NAI communicationmethodineachblock(n/b)p candidate by means of inference attacks on input data [41] or collu-
setsofBFs(i.e.allcandidatesetsofrecordsinablock)have sion [43]. To analyze the privacy against inference attacks,
to be generated and their CBFs calculated. The first party we discuss what the parties can learn through inference at-
performsb(n/b)summations,thesecondpartyb(n/b)2,and tacksorcollusionduringtheprotocol. Weassumeatrusted
finallyb(n/b)p summationsareperformedbythelast party, LU isavailable, which doesnotcolludewith anyparties, as
leadingtoatotalofO(Pp b(n/b)i)summationsinStep4. iscommonlyusedinmanypracticalPPRLapplications[29].
i=1
InStep5,eitherthelinkageunitLU orthefirstpartythat However, collusion among the database owners is a pri-
initiated the secure summation protocol performs a vector vacy risk in the basic secure summation protocol where a
subtraction operation (subtracts the random vectors from set of parties can collude to learn the BF of another party
the summed vectors) on all the candidate sets, resulting in using their received summation values. To overcome this
b(n/b)p CBFs. The similarities of these CBFs are then cal- problem, in Section 4 we have proposed two extended se-
culated and matches classified, which requires O(b(n/b)p) curesummationprotocols,HSS andSSS.TheHSS protocol
computations. Thiscombinatorialcomplexitycurrentlylim- uses homomorphicencryptionsfor securesummation which
its the NAI linkage to a small number of parties or a large makes the protocol more secure because without knowing
numberof small blocks(i.e. p or n/b has tobe small). theprivatekey(whichisonlyknowntotheLU)identifying
aparty’sBFvaluesbymeansofcollusionwillnotbesuccess- against an inference attack in Section 7.5. A larger number
ful. However, this protocol encryptseach integer valuein a ofpartiesinaringr(i.e.thelargerthevalueforx)resultsin
BF(intotallvaluesforeachBF)intoahashkey(longinte- anincreaseinthedifficultyofaninferenceattack(asmaller
gers) and thus incurs a very large communication overhead probability of suspicion 1/x) by the adversary (LU or the
making the protocol not viable for linkage of multiple large first party in each ring for SEQ and RBR, respectively) at
databases. The SSS protocol similarly makes the protocol thecost of more candidateset comparisons.
more secure by adding additional integer values as salting Further,usingdifferenthash encodingsh ,··· ,h bydif-
1 k
keys by each party individually (known only to the LU) in ferentringsinourSEQ andRBRmethodsimprovesprivacy
the secure summation, such that without knowing the salt- compared to the NAI method by reducing the possibilities
ingkeyvaluecollusionamongpartiestolearnaparty’sBFis ofcollusions,asdiscussedinSection5. Sincethehashfunc-
notpossible. ComparedtoHSS,theSSS approachdoesnot tions used by parties in ring 1, for example, are not known
incur any expensivecommunication overhead as thesalting topartiesinotherrings,acollusionbetweenpartiesinother
keysare small integer values. ringsand/ortheLU will notbesuccessfulin inferringthe
CommunicationoccursamongthepartiesinStep1ofour original values of parties in ring 1. A careful grouping of
protocol (as described in Section 3) where they agree on parties is therefore required to improve privacy in a multi-
parameter settings, and in Steps 4 and 5 where they par- partysetting(forexample,LU randomlygroupsorchanges
ticipate in a secure summation protocol. The agreement of the grouping for different blocks). More specifically, when
parameter settings in Step 1 does not reveal any sensitive p parties are involved in the linkage, the maximum num-
information about the underlying data. Secure summation berofpossiblecombinationsforcollusion inNAI methodis
involves partial (masked) data exchange where the parties Pp (p−1),whileinSEQ andRBRitisPp/rPr (r−1).
i=1 i=1 j=1
communicate the partial and full summations of their BFs For example, with p = 9 parties the NAI method has 72
(1) among them in Step 4 and (2) to the LU or the first possibilities to colludewhile groupingintoequalsized rings
party that initiated the communication in Step 5, respec- (r=3) leads to only 18 different collusion possibilities.
tively,tocalculatetheCBFsofthecandidatesetsandtheir Thevaluesforthenumberofhashfunctionsused(k)and
similarities. thelengthoftheBF(l)provideatrade-offbetweenthelink-
The LU (in SEQ or NAI) or the first party in each ring age quality and privacy [34]. The higher the value for k/l,
(in RBR) receives the CBFs of candidate sets in each ring. the higher the privacy and the lower the quality of linkage,
Compared tocalculating similarities of setsusingtheirBFs becausethenumberofq-gramsmappedtoasingle bit(and
directly,usingaCBFmakestheinferenceattackonindivid- thereforethenumberofresultingcollisions)increases,which
ualBFsandthustheirq-grams(strings) mappedintothem leadstolowerlinkagequalitybutmakesitmoredifficultfor
more difficult. An inference attack allows an adversary to anadversarytolearnthepossibleq-gramcombinations[16].
map a list of known values from a global dataset (e.g. q- The CLK encoding method (as discussed in Section 2.1) of
gramsorattributevaluesfromapublictelephonedirectory) hash-mappingseveralQIDvaluesfrom eachrecordintoone
totheencodedvalues(BFsorCBF)usingbackgroundinfor- compound BF [34, 38] makes it even more difficult for an
mation (such as frequency) [41, 16]. The only information adversarytolearnindividualQIDvaluesthatcorrespondto
that can be learned from such an inference attack using a a revealed bit pattern in a BF.
CBF c of a set of x BFs (summed over x parties, where ei-
6.3 Linkage quality analysis
therx=p in NAI and RBR methodsorx=r in SEQ and
RBR methods)isifabitposition inciseither0orxwhich OurprotocolsupportsapproximatematchingofQIDval-
means it is set to 0 or 1, respectively,in theBFs from all x ues, in that data errors and variations are taken into ac-
parties. count depending on the minimum similarity threshold s
t
used. The quality of BF-based masking dependson theBF
Proposition 6.1. Theprobabilityofidentifyingtheorig-
parameterization [34, 37]. For a given BF length, l, and
inal (unencoded) values of x (x > 1) individual records R
i thenumberof elementsg (e.g. q-grams) tobeinserted into
(with 1 ≤ i ≤ x) given a single CBF c is smaller than the
theBF,theoptimalnumberofhashfunctions,k,thatmini-
probability of identifying the original (unencoded values) of
mizesthefalsepositiveratef (ofacollision oftwodifferent
R given x individual BFs b , 1≤i≤x.
i i q-grams being mapped to the same bit position), is calcu-
∀x Pr(R |c)<Pr(R |b ) latedas[21]k=l/g ln(2),leadingtoafalsepositiverateof
i=1 i i i
f =(1/2ln(2))l/g.
(3)
Whilek andldeterminethecomputationalaspectsofBF
Proof. Assumethenumberoforiginal (unencoded)val-
masking, linkage quality and privacy will be determined by
uesthatcanbemappedtoamaskedBFpatternb froman
i the false positive rate f. A higher value for f will mean a
inference attack is n . n = 1 in the worst case, where a
g g larger numberof false matchesandthuslower linkagequal-
one-to-one mapping exists between the masked BF b and
i ity [21, 34]. In our experimental evaluation we will set the
theoriginalunencodedvalueofR . Theprobabilityofiden-
i BFparametersforourapproachaccordingtothediscussion
tifying the original value given a BF in the worst case sce-
presented here and following earlier BF work in PPRL [9,
nario is therefore Pr(R|b )=1/n =1.0. However, a CBF
i i g 34, 38, 37].
represents x BFs and thus at least (in the worst case) x
original (unencoded) values, which leads to a maximum of 7. EXPERIMENTAL EVALUATION
Pr(R |c) = 1/x with x > 1 (when x = 1, c ≡ b ). Hence,
i 1
∀x Pr(R |c)<Pr(R |b ). Inthissection,weempiricallyevaluatetheperformanceof
i=1 i i i our protocol (which we refer as AM-CBF for Approximate
We will empirically evaluate and compare the amount of MatchingwithCountingBFs)withtheSEQ,RBR,andNAI
privacy provided by masking records into CBFs and BFs communication patterns in terms of the three properties of
PPRL,scalability(complexity),linkagequality,andprivacy. morethantwopartiesthatwouldallowustoevaluatemulti-
WedescribethecompetingbaselinemethodsinSection7.1, party PPRL approaches. We therefore generated, based on
the datasets used in Section 7.2, the evaluation measures the real NCVR database, a series of sub-sets for multiple
in Section7.3,andtheexperimentalsettingsinSection 7.4. parties, as will be described next.
We thendiscuss theexperimental results in Section 7.5. To allow the evaluation of our approach with different
number of parties, with different dataset sizes, and with
7.1 Baseline methods data of different quality, we used and modified a recently
We use Lai et al. [17]’s exact matching BF-based PPRL proposeddatacorruptor[5,36]togeneratevariousdatasets
approach(referredasEM-BF forExactMatchingwithBFs) with different characteristics based on randomly selected
and Vatsalan and Christen [39, 40]’s approximate match- records from the NCVR database. The identifiers of the
ing BF-based PPRL approach (AM-BF for Approximate selected and modified records were kept the same, which
MatchingwithBFs)ascompetingbaselinemethodstocom- allows us to identify true and false matches and therefore
pare with our proposed approach. Since other existing ap- evaluate linkage quality, as described below. Specifically,
proaches for multi-party PPRL (as reviewed in Section 8) we extracted sub-sets of 5,000, 10,000, 50,000, 100,000,
are either based on expensive cryptographic techniques or 500,000, and 1,000,000 records from the NCVR to gener-
applicabletocategoricaldataonly,wedonotcomparethem atedatasetsfor3,5,7,and10parties,wherethenumberof
with our approach. matching records is set to 50% (i.e. half of selected records
Lai et al.’s EM-BF approach [17] performs exact match- occur in thedatasets of all parties).
ing of QIDs across multiple parties using BFs. In their ap- To evaluate how the approaches work with ‘dirty’ data,
proach, the QID values of all records in a dataset are first wecreatedseveralseriesofdatasetsforeachofthedatasets
converted into one BF. Each party then partitions its BF generatedabove,whereweincludedavaryingnumberofcor-
intosegmentsaccordingtothenumberofpartiesinvolvedin ruptedrecordsintothesetsofoverlappingrecords(0%,20%,
the linkage, and sends these segments to the corresponding and 40%). We applied various corruption functions on ran-
other parties. The segments received by a party are com- domlyselectedattributevalues,includingcharactereditop-
bined using a conjunction (logical AND) operation. The erations (insertions, deletions, substitutions, and transposi-
resulting conjuncted BF segments are then exchanged be- tions),andopticalcharacterrecognition andphoneticmod-
tweenthepartiestoconstructthefullconjunctedBF.Each ifications based on look-up tables and corruption rules [5].
party checks its own full BF of each record with the con- Thismeansthatacertainpercentageofrecordsintheover-
juncted BF, and if the membership test is successful then lap were modified for randomly selected parties, while the
the record is considered to be a match. Though the cost original values were kept for the other parties. Therefore,
of this approach is low since thecomputation is completely some of these records are exact duplicatesacross some par-
distributedbetweenthepartiesandtheprocessingofBFsis tiesinaset,butareonlyapproximatelymatchingduplicates
fast, theapproach can only perform exact matching. acrosstheotherpartiesintheset. Thissimulates,forexam-
VatsalanandChristenproposedAM-BF [39,40]byadapt- ple, the situation where three out of five hospitals have the
ingtheideausedinEM-BF ofdistributivelycomputingthe correctandcompletecontactdetails(likenameandaddress)
conjunction of a set of BFs from multiple parties to per- of a certain patient, while in the fourth and fifth hospitals
form privacy-preserving approximate matching for multi- some of thedetails of the same patient are different.
party PPRL. Once the conjuncted BF segments are com-
putedbytherespectiveparties,asecuresummationprotocol 7.3 Evaluationmeasures
is initiated among the parties to securely sum the number We evaluate the three properties of PPRL using the fol-
of common 1-bitsin theconjunctedBF segments as well as lowingevaluation measures. Thecomplexity(scalability) of
the total number of 1-bits in each party’s BF. These two linkageismeasuredbyruntime,communicationsize,andthe
sumsarethenused tocalculatetheDicecoefficient similar- number of comparisons required for the linkage. The qual-
ity of the set of BFs. A filtering approach is employed to ityoftheachievedlinkageismeasuredusingtheF-measure,
reduce the number of comparisons based on segment simi- calculated on classified matches and non-matches, that has
larity, such that if a sub-set of BF segments of a candidate widelybeenusedinrecordlinkage,informationretrievaland
set(ascalculatedbyarespectiveparty)haslowersimilarity data mining[2].
then the BFs do not have to be compared with any of the Inlinewithotherwork inPPRL[26,39,40,41],weeval-
BFs from theotherparties. uate privacy using disclosure risk (DR) measures based on
BothEM-BF andAM-BF approachesusetheNAI method theprobabilityofsuspicion,i.e.thelikelihoodamasked(en-
for comparing and classifying candidatesets of records. coded) database record in DM can be matched with one
or several (masked) record(s) in a publicly available global
7.2 Datasets
database G. The probability of suspicion for a masked
Toprovidearealisticevaluationofourapproach,webased value/record RM, Pr(RM), is calculated as 1/n where n
g g
allourexperimentsonalargereal-worlddatabase,theNorth is the number of possible matches in GM to the masked
Carolina Voter Registration (NCVR) database as available valueRM. Weconductedafrequencylinkageattack[41]by
fromftp://alt.ncsbe.gov/data/. Thisdatabasehasbeen mapping the exchanged bit information in the BFs gener-
usedfortheevaluationofvariousotherPPRLapproaches[9, atedfromDM totheBFsgeneratedfromGM. Weusedthe
26, 27, 39, 40, 41]. We have downloaded this database worst case scenario where G ≡ D, because when G ≡ D
every second month since October 2011 and built a com- there will be a one-to-one exact matching of a global value
bined temporal dataset thatcontains over8 million records foreachvalueinD. Basedonsuchlinkageattack,wecalcu-
of voters’ names and addresses [4]. We are not aware of late the following disclosure risk measures, as proposed by
any available real-world dataset that contains records from Vatsalan et al. [41].