Table Of ContentConcurrent Hash Tables: Fast and General(?)!
Tobias Maier1, Peter Sanders1, and Roman Dementiev2
1 Karlsruhe Institute of Technology, Karlsruhe, Germany {t.maier,sanders}@kit.edu
2 Intel Deutschland GmbH [email protected]
Abstract 1 Introduction
6 Concurrent hash tables are one of the most impor- A hash table is a dynamic data structure which
1 stores a set of elements that are accessible by their
tant concurrent data structures which is used in
0 key. Itsupportsinsertion,deletion,findandupdate
numerous applications. Since hash table accesses
2
inconstantexpectedtime. Inaconcurrenthashta-
can dominate the execution time of whole applica-
p ble,multiplethreadshaveaccesstothesametable.
tions, we need implementations that achieve good
e This allows threads to share information in a flex-
S speedup even in these cases. Unfortunately, cur-
ible and efficient way. Therefore, concurrent hash
rently available concurrent hashing libraries turn
6 tables are one of the most important concurrent
out to be far away from this requirement in partic-
] ular when adaptively sized tables are necessary or data structures. See Section 4 for a more detailed
S contention on some elements occurs. discussion of concurrent hash table functionality.
D To show the ubiquity of hash tables we give a
Our starting point for better performing data
. short list of example applications: A very sim-
s structures is a fast and simple lock-free concurrent
c ple use case is storing sparse sets of precom-
hash table based on linear probing that is however
[ puted solutions (e.g. [27], [3]). A more compli-
limited to word sized key-value types and does not
2 cated one is aggregation as it is frequently used
support dynamic size adaptation. We explain how
v in analytical data base queries of the form SELECT
to lift these limitations in a provably scalable way
7 FROM...COUNT...GROUP BYx[25]. Suchaqueryse-
1 and demonstrate that dynamic growing has a per- lects rows from one or several relations and counts
0 formance overhead comparable to the same gener-
for every key x how many rows have been found
4 alization in sequential hash tables.
0 (similarqueriesworkwithSUM,MIN,orMAX).Hash-
. We perform extensive experiments comparing ing can also be used for a data-base join [5]. An-
1
theperformanceofourimplementationswithsixof othergroupofexamplesistheexplorationofalarge
0
6 themostwidelyusedconcurrenthashtables. Ours combinatorial search space where a hash table is
1 are considerably faster than the best algorithms used to remember the already explored elements
: withsimilarrestrictionsandanorderofmagnitude (e.g.,indynamicprogramming[36],itemsetmining
v
i faster than the best more general tables. In some [28],achessprogram,orwhenexploringanimplic-
X
extreme cases, the difference even approaches four itly defined graph in model checking [37]). Simi-
r orders of magnitude. larly, a hash table can maintain a set of cached ob-
a
jectstosaveI/Os[26]. Furtherexamplesaredupli-
Category: [D.1.3] Programming Techniques
cateremoval,storingtheedgesetofasparsegraph
Concurrent Programming [E.1] Data Structures
in order to support edge queries [23], maintaining
Tables [E.2] Data Storage Representation Hash-
the set of nonempty cells in a grid-data structure
table representations
usedingeometryprocessing(e.g. [7]),ormaintain-
Terms: Performance, Experimentation, Mea- ing the children in tree data structures such as van
surement, Design, Algorithms Emde-Boas search trees [6] or suffix trees [21].
Keywords: Concurrency, dynamic data struc- Manyoftheseapplicationshaveincommonthat
tures, experimental analysis, hash table, lock- – even in the sequential version of the program –
freedom, transactional memory hash table accesses constitute a significant fraction
1
2 Related Work
of the running time. Thus, it is essential to have
highlyscalableconcurrenthashtablesthatactually
deliver significant speedups in order to parallelize Thispublicationfollowsuponourpreviousfindings
these applications. Unfortunately, currently avail- aboutgeneralizingfastconcurrenthashtables[18].
ablegeneralpurposeconcurrenthashtablesdonot In addition to describing how to generalize a fast
offer the needed scalability (see Section 8 for con- linear probing hash table, we offer an extensive
crete numbers). On the other hand, it seems to be experimental analysis comparing many concurrent
folklore that a lock-free linear probing hash table hash tables from several libraries.
where keys and values are machine words, which is There has been extensive previous work on con-
preallocatedtoaboundedsize,andwhichsupports current hashing. The widely used textbook “The
no true deletion operation can be implemented us- Art of Multiprocessor Programming” [12] by Her-
ing atomic compare-and-swap (CAS) instructions lihyandShavitdevotesanentirechaptertoconcur-
[36]. Find-operationscanevenproceednaivelyand rent hashing and gives an overview over previous
without any write operations. In Section 4 we ex- work. However, it seems to us that a lot of previ-
plain our own implementation (folklore) in detail, ous workfocuses moreon concepts andcorrectness
after elaborating on some related work, and intro- but surprisingly little on scalability. For example,
ducing the necessary notation (in Section 2 and most of the discussed growing mechanisms assume
3 respectively). To see the potential big perfor- that the size of the hash table is known exactly
mance differences, consider an exemplary situation without a discussion that this introduces a perfor-
withmostlyreadonlyaccesstothetableandheavy mance bottleneck limiting the speedup to a con-
contention for a small number of elements that are stant. Similarly, the actual migration is often done
accessed again and again by all threads. folklore sequentially.
actuallyprofitsfromthissituationbecausethecon- Stivala et al. [36] describe a bounded concurrent
tended elements are likely to be replicated into lo- linear probing hash table specialized for dynamic
cal caches. On the other hand, any implementa- programming that only support insert and find.
tion that needs locks or CAS instructions for find- Theirinsertoperationstartsfromscratchwhenthe
operations, will become much slower than the se- CAS fails which seems suboptimal in the presence
quentialcodeoncurrentmachines. Thepurposeof of contention. An interesting point is that they
our paper is to document and explain performance need only word size CAS instructions at the price
differences, and, more importantly, to explore to of reserving a special empty value. This technique
whatextentwecanmakefolklore moregeneralwith couldalsobeadaptedtoportourcodetomachines
an acceptable deterioration in performance. without 128-bit CAS.
Kim and Kim [14] compare this table with a
cache-optimizedlocklessimplementationofhashing
These generalizations are discussed in Section 5.
withchainingandwithhopscotchhashing[13]. The
We explain how to grow (and shrink) such a table,
experiments use only uniformly distributed keys,
andhowtosupportdeletionsandmoregeneraldata
i.e., there is little contention. Both linear prob-
types. InSection6weexplainhowhardwaretrans-
ing and hashing with chaining perform well in that
actionalmemorycanbeusedtospeedupinsertions
case. The evaluation of find-performance is a bit
and updates and how it may help to handle more
inconclusive: chaining wins but using more space
general data types.
than linear probing. Moreover it is not specified
whether this is for successful (use key of inserted
After describing implementation details in Sec- elements) or mostly unsuccessful (generate fresh
tion7,Section8experimentallycomparesourhash keys) accesses. We suspect that varying these pa-
tables with six of the most widely used concurrent rameters could reverse the result.
hash tables for microbenchmarks including inser- Gao et al. [10] present a theoretical dynamic lin-
tion, finding, and aggregating data. We look at ear probing hash table, that is lock-free. The main
both uniformly distributed and skewed input dis- contribution is a formal correctness proof. Not all
tributions. Section 9 summarizes the results and details of the algorithm or even an implementation
discusses possible lines of future research. isgiven. Thereisalsonoanalysisofthecomplexity
2
3 Preliminaries
of the growing procedure.
Shun and Blelloch [34] propose phase concurrent
hash tables which are allowed to use only a sin-
Weassumethateachapplicationthreadhasitsown
gleoperationwithinagloballysynchronizedphase.
designated hardware thread or processing core and
They show how phase concurrency helps to im-
denotethenumberofthesethreadswithp. Adata
plement some operations more efficiently and even
structure is non-blocking if no blocked thread cur-
deterministically in a linear probing context. For
rently accessing this data structure can block an
example, deletions can adapt the approach from
operationonthedatastructure byanotherthread.
[15] and rearrange elements. This is not possible
A data structure is lock-free if it is non-blocking
in a general hash table since this might cause find-
and guarantees global progress, i.e., there must al-
operationstoreportfalsenegatives. Theyalsoout-
ways be at least one thread finishing its operation
line an elegant growing mechanism albeit without
in a finite number of steps.
implementingitandwithoutfillinginallthedetail
like how to initialize newly allocated tables. They HashTables storeasetof(cid:104)Key,Value(cid:105)pairs(ele-
propose to trigger a growing operation when any ments).1 Ahashfunctionhmapseachkeytoacell
operation has to scan more than klogn elements of a table (an array). The number of elements in
where k is a tuning parameter. This approach is thehashtableisdenotednandthenumberofoper-
tempting since it is somewhat faster than the ap- ations is m. For the purpose of algorithm analysis,
proximate size estimator we use. We actually tried we assume that n and m are (cid:29)p2 – this allows us
that but found that this trigger has a very high to simplify algorithm complexities by hiding O(p)
variance–sometimesittriggerslatemakingopera- termsthatareindependentofnandmintheover-
tions rather slow, sometimes it triggers early wast- all cost. Sequential hash tables support the inser-
ingalotofspace. Wealsohavetheoreticalconcerns tion of elements, and finding, updating, or delet-
sincetheboundklognonthelengthofthelongest ing an element with given key – all of this in con-
probe sequence implies strong assumptions on cer- stantexpectedtime. Furtheroperationscomputen
tainpropertiesofthehashfunction. ShunandBlel- (size), build a table with a given number of initial
lochmakeextensiveexperimentsincludingapplica- elements, and iterate over all elements (forall).
tionsfromtheproblembasedbenchmarksuite[35].
Linear Probing is one of the most popular se-
Li et al. [17] use the bucket cuckoo-hashing quential hash table schemes used in practice. An
method by Dietzfelbinger and Weidling [8] and de- element (cid:104)x,a(cid:105) is stored at the first free table entry
velop a concurrent implementation. They exploit followingpositionh(x)(wrappingaroundwhenthe
that using a BFS-based insertion algorithm, the end of the table is reached). Linear probing is at
number of element moves for an insertion is very the same time simple and efficient – if the table is
small. Theyusefinegrainedlockswhichcansome- nottoofull,asinglecachelineaccesswillbeenough
timesbeavoidedusingtransactionalmemory(Intel most of the time. Deletion can be implemented by
TSX). As a result of their work, they implemented rearranging the elements locally [15] to avoid holes
the small open source library libcuckoo, which we violatingtheinvariantmentionedabove. Whenthe
measure against (which does not use TSX). This table becomes too full or too empty, the elements
approach has the potential to achieve very good can be migrated to a larger or smaller table re-
space efficiency. However, our measurements indi- spectively. The migration cost can be charged to
cate that the performance penalty is high. insertionsanddeletionscausingamortizedconstant
overhead.
The practical importance of concurrent hash ta-
bles also leads to new and innovative implementa-
tions outside of the scientific community. A good
example of this is the Junction library, that was
1Much of what is said here can be generalized to the
publishedbyPreshing[31]inthebeginningof2016,
case when Elements are black boxes from which keys are
shortly after our initial publication [19]. extractedbyanaccessorfunction.
3
4 Concurrent Hash Table In-
prefixandusetheabbreviationCASforbothsingle
terface and Folklore Imple- and double word CAS operations.
mentation
Initialization Theconstructorallocatesanarray
ofsizecconsistingof128-bitalignedcellswhosekey
Although it seems quite clear what a hash table is
is initialized to the empty values.
andhowthisgeneralizestoconcurrenthashtables,
there is a surprising number of details to consider.
Modifications We propose, to categorize all
Therefore,wewillquicklygooversomeofourinter-
changes to the hash table content into one of the
face decisions, and detail how this interface can be
followingthreefunctions, thatcanbeimplemented
implemented in a simple, fast, lock-free concurrent
very similarly (does not cover deletions).
linear probing hash table.
insert(k,d): Returns false if an element with
This hash table will have a bounded capacity
the specified key is already present. Only one op-
c that has to be specified when the table is con-
eration should succeed if multiple threads are in-
structed. It is the basis for all other hash table
serting the same key at the same time.
variants presented in this publication. We call this
table the folklore solution, because variations of it update(k,d,up): Returns false, if there is no
areusedinmanypublicationsanditisnotclearto value stored at the specified key, otherwise this
us by whom it was first published. function atomically updates the stored value to
The most important requirement for concurrent new = up(current,d). Notice, that the resulting
datastructuresis, thattheyshouldbelinearizable, value can be dependent on both the current value
i.e., it must be possible to order the hash table op- and the input parameter d.
erationsinsomesequence–withoutreorderingtwo insertOrUpdate(k,d,up): Thisoperationupdates
opperations of the same thread – so that executing the current value, if one is present, otherwise the
them sequentially in that order yields the same re- given data element is inserted as the new value.
sults as the concurrent processing. For a hash ta- Thefunctionreturnstrue,if insertOrUpdateper-
bledatastructure,thisbasicallymeansthatallop- formedaninsert(keywasnotpresent),andfalse
erations should be executed atomically some time if an update was executed.
between their invokation and their return. For ex- We choose this interface for two main reasons.
ample, it has to be avoided, that a find returns It allows applications to quickly differentiate be-
an inconsistent state, e.g. a half-updated data field tween inserting and changing an element – this is
thatwasneveractuallystoredatthecorresponding especiallyusefullsincethethreadwhofirstinserted
key. a key can be identified uniquely. Additionally it
Our variant of the folklore solution ensures the allows transparent, lockless updates that can be
atomicity of operations using 2-word atomic CAS morecomplex,thanjustreplacingthecurrentvalue
operations for all changes of the table. As long as (think of CAS or Fetch-and-Add).
the key and the value each only use one machine The update interface using an update function
word,wecanuse2-wordCASopearationstoatom- deserves some special attention, as it is a novel ap-
ically manipulate a stored key together with the proachcomparedtomostinterfacesweencountered
corresponding value. There are other variants that duringourresearch. Mostimplementationsfallinto
avoid need 2-word compare and swap operations, one of two categories: They return mutable refer-
but they often need a designated empty value (see ences to table elements – forcing the user to imple-
[31]) . Since, the corresponding machine instruc- ment atomic operations on the data type; or they
tionsarewidelyavailableonmodernhardware, us- offeranupdatefunctionwhichusuallyreplacesthe
ing them should not be a problem. If the target currentvaluewithanewone–makingitveryhard
architecture does not support the needed instruc- to implement atomic changes like a simple counter
tions, the implementation can easily be switched (find + increment + overwrite not necessarily
to use a variant of the folklore solution which does atomic).
not use 2-word CAS. As it can easily be deduced In Algorithm 1 we show the pseudocode of the
by the context, we will usually omit the “2-word” insertOrUpdate function. The operation com-
4
ALGORITHM 1: Pseudocode for the insertOrUpdate operation
Input: Key k, Data Element d, Update Function up: Key×Val×Val→Val
Output: Boolean true when a new key was inserted, false if an update occurred
1 i = h(k);
2 while true do
3 i = i % c;
4 current = table[i];
5 if current.key == empty key then // Key is not present yet ...
6 if table[i].CAS(current,(cid:104)k,d(cid:105)) then
7 return true
8 else
9 i--;
10 else if current.key == k then // Same key already present ...
11 if table[i].atomicUpdate(current, d, up) then
// default: atomicUpdate(·) = CAS( current, up( k,current.data, d))
12 return false
13 else
14 i--;
15 i++;
putes the hash value of the key and proceeds to constant expected running time.
look for an element with the appropriate key (be-
ginning at the corresponding position). If no ele-
Lookup Since this folklore implementation does
ment matching the key is found (when an empty
not move elements within the table, it would be
space is encountered), the new element has to be
possible for find(k) to return a reference to the
inserted. This is done using a CAS operation. A
corresponding element. In our experience, return-
failed swap can only be caused by another inser-
ing references directly tempts inexperienced pro-
tion into the same cell. In this case, we have to
grammers to opperate on these references in a way
revisit the same cell, to check if the inserted el-
that is not necessarily threadsafe. Therefore, our
ement matches the current key. If a cell storing
implementationreturnsacopyofthecorresponding
the same key is found, it will be updated using the
cell ((cid:104)k,d(cid:105)), if one is found ((cid:104)empty key,·(cid:105) other-
atomicUpdate function. This function is usually
wise). Thefindoperationhasaconstantexpected
implementedbyevaluatingthepassedupdatefunc-
running time.
tion(up)andusingaCASoperation,tochangethe
Our implementation of find somewhat non-
cell. In the case of multiple concurrent updates, at
trivial, because it is not possible to read two ma-
least one will be successful.
chine words at once using an atomic instruction2.
In our (C++) implementation, partial template Therefore it is possible for a cell to be changed in-
specialization can be used to implement more ef- betweenreadingitskeyanditsvalue–thisiscalled
ficient atomicUpdate variants using atomic opera- atorn read. Wehavetomakesure, thattornreads
tions – changing the default line 11, e.g. overwrite cannot lead to any wrong behavior. There are two
(using single word store), increment (using fetch kinds of interesting torn reads: First an empty key
and add). is read while the searched key is inserted into the
The code presented in Algorithm 1 can easily be samecell,inthiscasetheelementisnotfound(con-
modified to implement the insert (return false sistent since it has not been fully inserted); Second
when the key is already present – line 10) and
update (return true after a successful update – 2The element is not read atomically, because x86 does
not support that. One could use a 2-word CAS to achieve
line 12 and false when the key is not found –
thesameeffectbutthiswouldhavedisastrouseffectsonper-
line5)functions. Allmodificationfunctionshavea formancewhenmanythreadstrytofindthesameelement.
5
the element is updated between the key being read solve most shortcomings of the folklore implemen-
andthedatabeingread, sincethedataisreadsec- tation (especially deletions and adaptable size).
ond, only the newer data is read (consistent with a
finished update). 5.1 Storing Thread-Local Data
By itself, storing thread specific data connected to
Deletions The folklore solution can only han-
ahashtabledoesnotofferadditionalfunctionality,
dledeletionsusingdummyelements–calledtomb-
but it is necessary to efficiently implement some of
stones. Usually the key stored in a cell is replaced
our other extensions. Per-thread data can be used
with del key. Afterwards the cell cannot be used
in many different ways, from counting the number
anymore. This method of handling deleted ele-
of insertions to caching shared resources.
mentsisusuallynotfeasible,asitdoesnotincrease
From a theoretical point of view, it is easy to
the capacity for new elements. In Section 5.4 We
store thread specific data. The additional space is
will show, how our generalizations can be used to
usually only dependent on the number of threads
handle tombstones more efficiently.
(O(p) additional space), since the stored data is
often constant sized. Compared to the hash table
Bulk Operations While not often used in prac- this is usually negligible (p(cid:28)n<c).
tice, the folklore table can be modified to sup- Storing thread specific data is challenging from
portoperationslikebuildFrom(·)(seeSection5.5) a software design and performance perspective.
– using a bulk insertion which can be more effi- Some of our competitors use a register(·) func-
cientthanelement-wiseinsertion–orforall(f)– tion that each thread has to call before accessing
which can be implemented embarrassingly parallel the table. This allocates some memory, that can
by splitting the table between threads. be accessed using the global hash table object.
Our solution uses explicit handles. Each thread
hastocreateahandle,beforeaccessingthehashta-
Size Keepingtrackofthenumberofcontainedel-
ble. These handles can store thread specific data,
ementsdeservesspecialnoticeherebecauseitturns
since they are not shared between threads. This is
outtobesignificantlyharderinconcurrenthashta-
not only in line with the RAII idiom (resource ac-
bles. Insequentialhashtables,itistrivialtocount
quisition is initialization [24]), it also protects our
the number of contained elements – using a single
implementationfromsomeperformancepitfallslike
counter. This same method is possible in parallel
unnecessary indirections and false sharing3. More-
tablesusingatomicfetchandaddoperations,butit
over,thedatacaneasilybedeletedoncethethread
introduces a massive amount of contention on one
does not use the hash table anymore (delete the
single counter creating a performance bottleneck.
handle).
Because of this we did not include a counting
method in folklore implementation. In Section 5.2
5.2 Approximating the Size
we show how this can be alleviated using an ap-
proximate count.
Keeping an exact count of the elements stored in
the hash table can often lead to contention on one
count variable. Therefore, we propose to support
5 Generalizations and Exten-
only an approximative size operation.
sions To keep an approximate count of all elements,
eachthreadmaintainsalocalcounterofitssuccess-
In this section, we detail how to adapt the concur- ful insertions (using the method desribed in Sec-
rent hash table implementation – described in the tion 5.1). Every Θ(p) such insertions this counter
previous section – to be universally applicable to is atomically added to a global insertion counter
all hash table workloads. Most of our efforts have I and then reset. Contention at I can be provably
gone into a scalable migration method that is used
3Significant slow down created by the cache coherency
to move all elements stored in one table into an-
protocol due to multiple threads repeatedly changing dis-
other table. It turns out that a fast migration can tinctvalueswithinthesamecacheline.
6
madesmallbyrandomizingtheexactnumberoflo- of linear probing and our scaling function, there is
cal insertions accepted before adding to the global a surprisingly simple way to migrate the elements
counter, e.g., between 1 and p. I underestimates fromtheoldtabletothenewtableinparallelwhich
the size by at most O(cid:0)p2(cid:1). Since we assume the results in exactly the same order a sequential algo-
size to be (cid:29)p2 this still means a small relative er- rithm would take and that completely avoids syn-
ror. By adding the maximal error, we also get an chronization between threads.
upper bound for the table size.
Lemma1. Considerarangea..bofnonemptycells
Ifdeletionsarealsoallowed,wemaintainaglobal
in the old table with the property that the cells
counter D in a similar way. S = I −D is then a
a−1modc and b+1modc are both empty – call
good estimate of the total size as long as S (cid:29)p2.
such a range a cluster (see Figure 1a). When
When a table is migrated for growing or shrink-
migrating a table, sequential migration will map
ing (see Section 5.3.1), each migration thread lo-
the elements stored in that cluster into the range
cally counts the elements it moves. At the end of
(cid:98)γa(cid:99)..(cid:98)γ(b+1)(cid:99) in the target table, regardless of
the migration, local counters are added to create
the rest of the source array.
the initial count for I (D is set to 0).
This method can also be extended to give an Proof. Letxbeanelementstoredintheclustera..b
exact count – in absence of concurrent inser- at position p(x)=hc(x)+d(x). Then hc(x) has to
tions/deletions. To do this, a list of all handles be in the cluster a..b, because linear probing does
has to be stored at the global hash table object. A not displace elements over empty cells (hc(x) =
thread can now iterate over all handles computing (cid:98)h(x)c(cid:99)≥a), and therefore, h(x)c(cid:48) ≥ac(cid:48) ≥γa.
U U c
the actual element size. Similarly, from (cid:98)h(x)c(cid:99) ≤ b follows h(x)c <
U U
b+1, and therefore, h(x)c(cid:48) <γ(b+1).
U
5.3 Table Migration
Therefore, two distinct clusters in the source ta-
ble cannot overlap in the target table. We can ex-
WhileGaoetal.[10]haveshownthatlock-freedy-
ploit this lemma by assigning entire clusters to mi-
namiclinearprobinghashtablesarepossible,there
gratingthreadswhichcanthenprocesseachcluster
is no result on their practical feasibility. Our focus
completelyindependently. Distributingclustersbe-
is geared more towards engineering the fastest mi-
tween threads can easily be achieved by first split-
gration possible, therefore, we are fine with small
ting the table into blocks (regardless of the tables
amountsoflocking,aslongasitimprovestheover-
contents)whichweassigntothreadsforparallelmi-
all performance.
gration. A thread assigned block d..e will migrate
thoseclustersthatstartwithinthisrange–implic-
5.3.1 Eliminating Unnecessary Contention
itlymovingtheblockborderstofreecellsasseenin
from the Migration
Figure1b). Sincetheaverageclusterlengthisshort
If the table size is not fixed, it makes sense to as- and c = Ω(cid:0)p2(cid:1), it is sufficient to deal out blocks
sume that the hash function h yields a large pseu- of size Ω(p) using a single shared global variable
dorandom integer which is then mapped to a cell andatomicfetch-and-addoperations. Additionally
position in 0..c−1 where c is the current capacity eachthreadisresponsibleforinitializingallcellsin
c.4 We will discuss a way to do this by scaling. If its region of the target table. This is important,
hyieldsvaluesintheglobalrange0..U−1wemap because sequentially initializing the hash table can
key x to cell h (x) := (cid:98)h(x)c(cid:99). Note that when quickly become infeasible.
c U
both c and U are powers of two, the mapping can Note that waiting for the last thread at the end
be implemented by a simple shift operation. ofthemigrationintroducessomewaiting(locking),
butthisdoesnotcreatesignificantworkimbalance,
since the block/cluster migration is really fast and
Growing Now suppose that we want to migrate
clusters are expected to be short.
thetableintoatablethathasatleastthesamesize
(growing factor γ ≥ 1). Exploiting the properties
Shrinking Unfortunately, the nice structural
4Weusex..y asashorthandfor x,...,y inthispaper. Lemma1nolongerapplies. Wecanstillparallelize
{ }
7
γa
a
b γ(b+1)
ab00 γa0
γ(b0+1)
(a) Two neighboring clusters and their non- (b)Left: tablesplitintoevenblocks. Right: resulting
overlapping target areas (γ =2). cluster distribution (moved implicit block borders).
Figure 1: Cluster migration and work distribution
the migration with little synchronization. Once the capacity will be increased by a factor of γ ≥ 1
more, we cut the source table into blocks that we (Usuallyγ =2). Thedifficultyisensuringthatthis
assign to threads for migration. The scaling func- operation is done in a transparent way without in-
tion maps each block a..b in the source table to a troducing any inconsistent behavior and without
blocka(cid:48)..b(cid:48) inthetargettable. Wehavetobecare- incurring undue overheads.
ful with rounding issues so that the blocks in the Tohidethemigrationprocessfromtheuser,two
targettablearenon-overlapping. Wecanthenpro- problems have to be solved. First, we have to find
ceed in two phases. First, a migrating thread mi- threads to grow the table, and second, we have to
grates those elements that move from a..b to a(cid:48)..b(cid:48). ensure, that changing elements in the source table
These migrations can be done in a sequential man- willnotleadtoanyinconsistentstatesinthetarget
ner, since target blocks are disjoint. The majority table (possibly reverting changes made during the
of elements will fit into the target block. Then, af- migration). Each of these problems can be solved
ter a barrier synchronization, all elements that did in multiple ways. We implemented two strategies
not fit into their respective target blocks are mi- for each of them resulting in four different variants
grated using concurrent insertion i.e., using atomic of the hash table (mix and match).
operations. This has negligible overhead since el-
ements like this only exist at the boundaries of
Recruiting User-Threads A simple approach
blocks. The resulting allocation of elements in the
to dynamically allocate threads to growing the ta-
target table will no longer be the same as for a
ble, is to “enslave” threads that try to perform
sequential migration but as long as the data struc-
table accesses that would otherwise have to wait
ture invariants of a linear probing hash table are
for the completion of the growing process anyway.
fulfilled, this is not a problem.
This works really well when the table is regularly
accessedbyalluser-threads,butisinefficientinthe
5.3.2 Hiding the Migration from the Un- worstcasewhenmostthreadsstopaccessingtheta-
derlying Application bleatsomepoint,e.g.,waitingforthecompletionof
a global computation phase at a barrier. The few
To make the concurrent hash table more general
threads still accessing the table at this point will
and easy to use, we would like to avoid all explicit
need a lot of time for growing (up to Ω(n)) while
synchronization. The growing (and shrinking) op-
most threads are waiting for them. One could try
erationsshouldbeperformedasynchronouslywhen
toalsoenslavewaitingthreadsbutitlooksdifficult
needed, without involvement of the underlying ap-
todothisinasufficientlygeneralandportableway.
plication. The migration is triggered once the ta-
ble is filled to a factor ≥ α (e.g. 50%), this is
estimated using the approximate count from Sec- Using a Dedicated Thread Pool A provably
tion 5.2, and checked whenever the global count is efficientapproachistomaintainapoolofpthreads
updated. When a growing operation is triggered, dedicated to growing the table. They are blocked
8
untilagrowingoperationistriggered. Thisiswhen isnotfeasibletoacquireacountingpointerforeach
they are awoken to collectively perform the migra- operation. Insteadacopyofthesharedpointercan
tion in time O(n/p) and then get back to sleep. be stored locally, together with the increasing ver-
Duringamigration,applicationthreadsmighthave sionnumberofthecorrespondinghashtable(using
to sleep until the migration threads are finished. the method from Section 5.1). At the beginning of
This will increase the CPU time of our migration eachoperation,wecanusethelocalversionnumber
threads making this method nearly as efficient as to make sure that the local counting pointer still
the enslavement variant. Using a reasonable com- pointstothenewesttableversion. Ifthisisnotthe
putation model, one can show that using thread case, a new pointer will be acquired. This happens
pools for migration increases the cost of each table only once per version of the hash table. The old
accessbyatmostaconstantinagloballyamortized table will automatically be freed once every thread
sense (over the non-growing folklore solution). We has updated its local pointer. Note that counting
omit the relatively simple proof. pointers cannot be exchanged in a lock-free man-
To remain fair to all competitors, we used ex- nerincreasingthecostofchangingthecurrenttable
actly as many threads for the thread pool as there (usingalock). Thislockcouldbeavoidedbyusing
were application threads accessing the table. Ad- a hazard pointer. We did not do this
ditionally each migration thread was bound to a
core, that was also used by one corresponding ap-
plication thread. Prevent Concurrent Updates to ensure Con-
sistency (synchronized) We propose a simple
protocol inspired by read-copy-update protocols
Marking Moved Elements for Consistency [22]. Thethreadttriggeringthegrowingoperation
(asynchronous) During the migration it is im- sets some global growing flag using a CAS instruc-
portant that no element can be changed in the old tion. A thread t performing a table access sets a
tableafterithasbeencopiedtothenewtable. Oth- local busy flag when starting an operation. Then
erwise, it would be hard to guarantee that changes it inspects the growing flag, if the flag is set, the
are correctly applied to the new table. The easiest local flag is unset. Then the local thread waits for
solutiontothisproblemis,tomarkeachcellbefore the completion of the growing operation, or helps
it is copied. Marking each cell can be done using with migrating the table depending on the current
a CAS operation to set a special marked bit which growing strategy. Thread t waits until all busy
is stored in the key. In practice this reduces the flags have been unset at least once before starting
possible key space. If this reduction is a problem, the migration. When the migration is completed,
see Section 5.6 on how to circumvent it. To ensure the growing flag is reset, signaling to the waiting
that no copied cell can be changed, it suffices to threads that they can safely continue their table-
ensure that no marked cell can be changed. This operations. Because this protocol ensures that no
can easily be done by checking the bit before each thread is accessing the previous table after the be-
writingoperation,andbyusingCASoperationsfor ginning of the migration, it can be freed without
each update. This prohibits the use of fast atomic using reference counting.
operations to change element values. We call this method (semi-)synchronized, be-
After the migration, the old hash table has to cause grow and update operations are disjoint.
be deallocated. Before deallocating an old table, Threads participating in one growing step still ar-
we have to make sure that no thread is currently rive asynchronously, e.g. when the parent applica-
using it anymore. This problem can generally be tion called a hash table operation. Compared to
solvedbyusingreferencecounting. Insteadofstor- the marking based protocol, we save cost during
ing the table with a usual pointer, we use a ref- migration by avoiding CAS operations. However,
erence counted pointer (e.g. std::shared ptr) to this is at the expense of setting the busy flags for
ensure that the table is eventually freed. every operation. Our experiments indicates that
The main disadvantage of counting pointers overall this is only advantageous for updates using
is that acquiring a counting pointer requires an atomic operations like fetch-and-add that cannot
atomicincrementonasharedcounter. Therefore,it coexist with the marker flags.
9
5.4 Deletions tions, deletions, and updates. We outline a simple
algorithmforbulk-insertionthatworkswithoutex-
For concurrent linear probing, we combine tomb-
plicit sorting albeit does not avoid contention. Let
stoning (see Section 4) with our migration algo-
a denote the old size of the hash table and b the
rithm to clean the table once it is filled with too
numberofinsertions. Thena+bisanupperbound
many tombstones.
for the new table size. If necessary, grow the table
Atombstone isanelement,thathasadel keyin
to that size or larger (see below). Finally, in paral-
place of its key. The key x of a deleted entry (cid:104)x,a(cid:105)
lel, insert the new elements.
is atomically changed to (cid:104)del key,a(cid:105). Other ta-
More generally, processing batches of size m =
bleoperationsscanoverthesedeletedelementslike
Ω(n) in a globally synchronized way can use the
over any other nonempty entry. No inconsistencies
same strategy. We outline it for the case of bulk
canarisefromdeletions. Inparticular,aconcurrent
insertions. Generalization to deletions, updates,
find-operationswithatornreadwillreturntheele-
or mixed batches is possible: Integer sort the ele-
ment before the deletion since the delete-operation
ments to be inserted by their hash key in expected
willleavethevalue-slotauntouched. Aconcurrent
timeO(m/p). Amongelementswiththesamehash
insert (cid:104)x,b(cid:105) might read the key x before it is over-
value, remove all but the last. Then “merge” the
written by the deletion and return false because
batch and the hash table into a new hash table
it concludes that an element with key x is already
(thatmayhavetobelargertoprovidespaceforthe
present. This is consistent with the outcome when
new elements). We can adapt ideas from parallel
the insertion is performed before the deletion in a
merging [11]. We co-partition the sorted insertion
linearization.
array and the hash table into corresponding pieces
This method of deletion can easily be imple-
of size O(m/p). Most of the work can now be done
mentedinthefolkloresolutionfromSection4. But
on these pieces in an embarrassingly parallel way –
the starting capacity has to be set dependent on
eachpieceoftheinsertionarrayisscannedsequen-
the number of overall insertions, since this form of
tiallybyonethread. Consideranelement(cid:104)x,a(cid:105)and
deletion does not free up any of the deleted cells.
previous insertion position i in the table. Then we
Even worse, tombstones will fill up the table and
start looking for a free cell at position max(h(x),i)
slow down find queries.
Both of these problems can be solved by migrat-
5.6 Restoring the Full Key Space
ing all non-tombstone elements into a new table.
The decision when to migrate the table should be Our table uses special keys, like the empty key
made solely based on the number of insertions I (empty key) and the deleted key (del key). El-
(= number of nonempty cells). The count of all ements that actually have these keys cannot be
non-deleted elements I −D is then used to decide stored in the hash table. This can easily be fixed
whether the table should grow, keep the same size by using two special slots in the global hash table
(noticeγ =1isaspecialcaseforouroptimizedmi- data structure. This makes some case distinction
gration), or shrink. Either way, all tombstones can necessarybutshouldhaveratherlowimpactonthe
be removed in the course of the element migration. overall performance.
Oneofourgrowingvariants(asynchronous)uses
5.5 Bulk Operations a marker bit in its key field. This halves the possi-
blekeyspacefrom264to263. Toregainthelostkey
Building a hash table for n elements passed to the space, we can store the lost bit implicitly. Instead
constructorcanbeparallelizedusingintegersorting of using one hash table that holds all elements, we
by the hash function value. This works in time use the two subtables t and t . The subtable t
0 1 0
O(n/p) regardless how many times an element is holds all elements whose key does not have its top-
inserted, i.e., sorting circumvents contention. See most bit set. While t stores all elements whose
1
the work of Mller et al.[25] for a discussion of this key does have the topmost bit set, but instead of
phenomenon in the context of aggregation. storing the topmost bit explicitly it is removed.
Thiscanbegeneralizedforprocessingbatchesof Eachelementcanstillbefoundinconstanttime,
sizem=Ω(n)thatmayevencontainamixofinser- because when looking for a certain key, it is imme-
10