Table Of ContentTo Vote Before Decide: A Logless One-Phase Commit Protocol
for Highly-Available Datastores
Yuqing Zhu #1, Philip S. Yu △2, Guolei Yi +3, Wenlong Ma #4, Mengying Guo #4, Jianxun Liu #4
#ICT, Chinese Academy of Sciences, Beijing, China
△
University of Illinois at Chicago, USA
+Baidu, Bejing, China
[email protected], [email protected], [email protected]
7
1 4{mawenlong,guomengying,liujianxun}@ict.ac.cn
0
2
n
Abstract—Highly-available datastores are widely deployed The state-of-the-art practice for distributed commit in
a
J for online applications. However, many online applications highly-available datastores is to have the transaction client
are not contented with the simple data access interface decide before the transaction participants vote [10]–[15],
1 currently provided by highly-available datastores. Distributed
1 transactionsupportisdemandedbyapplicationssuchaslarge- denoted as the vote-after-decide approach. On deciding,
scale online payment used by Alipay or Paypal. Current the client initiates a distributed commit process, which
] solutions to distributed transaction can spend more than half typically incurs two phases of processing. Participants vote
C
ofthewholetransactionprocessingtimeindistributedcommit. on the decision in the first phase of commit, with the votes
D An efficient atomic commit protocol is highly desirable. This
recordinginlogsorthroughreplication.Thesecondphaseis
. paper presents the HACommit protocol, a logless one-phase
s for notifying the commit outcome and applying transaction
commit protocol for highly-available systems. HACommit has
c transaction participants vote for a commit before the client changes. Even if the transaction client can be notified of
[
decidestocommitorabortthetransaction;incomparison,the the commit outcome at the end of the first phase [10], [11],
2 state-of-the-art practice for distributed commit is to have the [14], [15], the commit is not completed and the transaction
v client decide before participants vote. The change enables the
resultis notvisible to othertransactionsuntilthe end of the
8 removal of both the participant logging and the coordinator
secondphase.Thetwoprocessingphasesinvolveatleasttwo
0 logging steps in the distributed commit process; it also makes
4 possible that, after the client initiates the transaction commit, communicationroundtrips,aswellasthestepforloggingto
2 the transaction data is visible to other transactions within write-ahead logs [16] or for replicating among servers. The
0 one communication roundtrip time (i.e., one phase). In the communicationroundtripsandtheloggingorreplicatingstep
1. evaluationwithextensiveexperiments,HACommitoutperforms are costly procedures in distributed processing. They lead
recent atomic commit solutions for highly-available datastores
0 to a long distributed commit process, which then reduces
under different workloads. In the best case, HACommit can
7 commit in one fifth of the time 2PC does. transaction throughputs.
1
A different approach to distributed commit is having
: Keywords-atomiccommit,highavailability,transaction,2PC,
v participants vote for a commit before the client decides
consensus
i to commit or abort the transaction, denoted as the vote-
X
before-decide approach. Having the participants vote first,
r I. INTRODUCTION
a the voting step can overlap with the processing of the last
Onlineapplicationshavestrongrequirementsonavailabi- transaction operation, saving one communication roundtrip;
lity;theirdatastoragewidelyexploitshighly-availabledata- and, the votes can be replicated at the mean time, instead
stores [1]–[3]. For highly-available datastores, distributed of usinga separateprocessingstep. Thismakesthe removal
transaction support is highly desirable. It can simplify of one processing phase possible. On receiving the client’s
application development and facilitate large-scale online commit decision, the participants can directly commit the
transacting business like Paypal [4], Alipay [5] or Baidu transaction locally; thus, the transaction data can be made
Wallet [6]. Besides, it can enable quick responses to big visible to other transactions within one communication
data queries through materialized view and incremental roundtrip time, i.e., one phase. Though previous one-
processing [7], [8]. The benefits of transactions come from phase commit protocols also have participants vote early,
the ACID (atomicity, consistency, isolation and durability) they need to make several impractical assumptions, e.g.,
guarantees [9]. The atomic commit process is key to the log externalization [17]; besides, they rely heavily on the
guarantee of ACID properties. Current solutions to atomic coordinator logs to guarantee atomicity and durability.
commitincursahighcostinhibitingonlineapplicationsfrom In this paper, we present the HACommit protocol, a
usingdistributedtransactions.Afastatomiccommitprocess logless one-phase commit protocol for highly-available
is highly desirable. systems. HACommittakes the vote-before-decideapproach.
In order to remove logging and enable one-phase commit, evaluated its performance using a YCSB-based transaction
HACommit tackles two key challenges: the first is how to benchmark [20]. As the number of participants and data
commit(abort)atransactioncorrectlyinaone-phaseprocess; itemsinvolvedinatransactionisthekeyfactoraffectingthe
and, the second is how to guarantee a correct transaction performanceof commitprotocols,we evaluatedHACommit
recoveryonparticipantorcoordinatorfailureswithoutusing and several recent protocols [12], [14], [15] by varying the
logs. numberof operationspertransaction.In the evaluationwith
For the first challenge, we observe that, with the vote- extensive experiments, HACommit can commit in less than
before-decide approach, the commit process becomes a a millisecond. In the best case, HACommit can commit in
problemthattheclientproposesadecisiontobeacceptedby one fifth of the time that the widely-used 2PC commits.
participants.Thisproblemiswidelyknownastheconsensus
problem [18]. Consensus algorithms are solutions to the Roadmap. Section II discusses related work. Section III
consensus problem. The widely used consensus algorithm overviews the design of HACommit. Section IV details
Paxos [19] can reach a consensus among participants (ac- the last operation processing in HACommit and Section V
ceptors)in a one-phaseprocess,if the proposeris the initial describes the commit process. Section VI presents the
proposer in a run of the algorithm. HACommit runs the recovery processes on client and participant failures. We
Paxosalgorithm once for each transaction commit(abort).It reportourperformanceevaluationsinSectionVII.Thepaper
usestheuniqueclientastheinitialproposerofthealgorithm is brought to a close with conclusions in Section VIII.
and the participants as the acceptors and learners. Thus,
the client can propose any value, either commit or abort, II. RELATEDWORK
to be acceptedby participantsasthe consensus.HACommit Atomic commit protocols (ACPs). A large body of
proposesanewprocedureforprocessingthelasttransaction work studied the atomic commit problem in distributed
operationsuchthatconsensusalgorithmscanbeexploitedin environment both in database community [16], [17], [21]
the commit process. To exploit Paxos, HACommit designs and distributed computing community [22], [23]. The most
a transaction context structure to keep Paxos configuration widely used atomic commit protocol is two-phase commit
information for the commit process. (2PC) [9]. It has been proposed decades ago, but remains
For the second challenge, we notice that consensus algo- widely exploited in recent years [12], [15], [24]–[26]. 2PC
rithms can reach an agreement among a set of participants involves at least two communication round trips between
safely even on proposer failures. As HACommit exploits the transaction coordinator and the participants. Relying on
Paxos and uses the client as the proposer/coordinator, the both coordinator and participant logs for fault tolerance, it
client failure will not block the commit process. On the is blocking on coordinator failures.
clientfailure,HACommitrunstheclassicPaxosalgorithmto Non-blockingatomic commit protocolswere proposedto
reach the same transaction outcome among the participants, avoid the blocking on coordinator failures during commit.
which act as would-be proposers replacing the failed But some assume the impractical model of synchronous
client. Furthermore, we observe that, in practice, the high communication and incur high costs, so they are rarely
availability of data in highly-available datastores leads to implemented in real systems [27]. Those assuming the
an equal effect of fail-free participants during commit. asynchronous system model generally exploit coordinator
Instead of using logs for participant failure recovery, replication and the fault-tolerant consensus protocol [21],
HACommit has participants replicate their votes and the [22]. These non-blocking ACPs generally incur an even
transaction metadata to their replicas when processing the higher cost than 2PC. Besides, they are all designed taking
last transaction operation. For participant replica failures, the same vote-after-decide approach as 2PC, i.e., that
HACommit proposes a recovery process that exploits the participants vote after the client decides.
replicated votes and metadata. One-phase commit (1PC) protocols were proposed to
With HACommit, a highly-available datastore can not reduce the communicationcosts of 2PC. Comparedto 2PC,
onlyrespondto the clientcommitrequestwithin onephase, they reduce both the number of forced log writes and
as in other state-of-art commit solutions [10], [14], [15], communication roundtrips. The price is to send all parti-
but also makes the transaction changes visible to other cipants’ logs to the coordinator[28] or to make impractical
transactions within one phase, increasing the transaction assumptions on systems, e.g, consistency checking on each
concurrency.Withoutclientfailures,HACommitcancommit update [17]. Non-blocking 1PC protocols also exist. They
a transaction within two message delays. Based on Paxos, havethesame problemsasblocking1PCprotocols.Though
HACommit is non-blocking on client failures; and, it 1PC protocols have participants vote for commit before the
can also tolerate participant replica failures. HACom- client decides as HACommit does, they do not allow the
mit can be used along with various concurrency control client to abort the transaction if all transaction operations
schemes [9], [17], e.g., optimistic, multi-version or lock- are successfully executed [9]. In comparison, HACommit
basedconcurrencycontrol.WeimplementedHACommitand gives the client all the freedom to abort a transaction.
All the above atomic protocols do not consider the high }Execution }Commit
decide
availability of data as a condition, thus involving unneces- Client
(coordinator)
sary logging steps for failure recovery at the participants lastop result
vote YES outcome ack
or the coordinator. Exploiting the high availability of data,
the participant logging step can be easily implemented Participants apply
as a process of data replication, which is executed for replicate votes
& txn context
each operation in highly-available datastores–no matter the
Participant
operation belongs to a transaction or not. replicas
ACPs for highly-available datastores. In recent years,
Figure1. Anexample commitprocessusingHACommit.
quite a few solutions are proposed for atomic commit in
highly-available datastores. Spanner [12] layers two phase
commit[15]layersPaxosover2PC. Inessence,itreplicates
locking and 2PC over the non-blocking replica synchro-
two-phase commit operations among datacenters and uses
nization protocol of Paxos [19]. Spanner is non-blocking
Paxos to reach consensus on the commit decision. It
due to the replication of the coordinator’s and participants’
requiresthefullreplicaineachdatacenter,whichprocesses
logs by Paxos, but it incurs a high cost in commit.
transactions independently and in a blocking manner.
Message futures [29] proposes a transaction manager that
All the above ACPs for highly-available datastores take
utilizes a replication log to check transaction conflicts and
the vote-after-decide approach. In comparison, HACommit
exchange transactions information across datacenters. The
exploits the vote-before-decide approach to enable the
concurrency server for conflict checking is the bottleneck
removalofoneprocessingphaseandtheremovaloflogging
for scalability and performance. Besides, the assumption of
in commit. HACommit overlaps the participant voting with
sharedlogsare impracticalin realsystems[17].Helios [11]
the processing of the last operation. Using the unique
also exploits a log-based commit process. It can guarantee
client as the transaction coordinator and the initial Paxos
the minimum transaction conflict detection time across
proposer,HACommitcommitsthe transaction in one phase,
datacenters. However, it relies on a conflict detection
at the end of which the transaction data made visible to
protocol for optimistic concurrencycontrol using replicated
other transactions. HACommit exploits the high availability
logs, which makes strong assumptions on one replica
of data for failure recovery, instead of using the classic
knowing all transactions of any other replica within a
approach of logging.
critical time interval, which is impossible for asynchronous
systems with disorderly messages [30]. The safety property
III. OVERVIEW OF HACOMMIT
ofHeliosinguaranteeingserializabilitycanbethreatenedby
the fluctuation of cross-DC communicationlatencies. These HACommit is designed to be used in highly-available
commit proposals heavily exploit transaction logs, while datastores, which guarantee high availability of data. Gen-
logging is costly for transaction processing [31]. erally, highly-available datastores partition data into shards
MDCC [14] proposes a commit protocol based on Paxos and distribute them to networked servers to achieve high
variants for optimistic concurrency control [9]. MDCC scalability.Toguaranteehighavailabilityofdata,eachshard
exploits the application server as the proposer in Paxos, is replicated across a set of servers. Clients are front-
while the application server is in fact the transaction client. end application servers or any proxy service acting for
Though its application server can find out the transaction applications. Clients can communicate with servers of the
outcome within one processing phase, the commit process highly-available datastores. A transaction is initiated by a
of MDCC is inherently two-phase, i.e., a voting phase client.Atransactionparticipantisaserverholdinganyshard
followed by a decision-sending phase, and no concurrent operated by the transaction, while servers holding replicas
accesses are permitted over outstanding options during the of a shard are called participant replicas.
commit process. TAPIR [10] has a Paxos-based commit The implementation of HACommit involves both client
process similar to that of MDCC, but TAPIR can be and server sides. On the client side, it provides an atomic
used with pessimistic concurrency control mechanisms. It commit interface via a client-side library for transaction
also uses the client as the proposer in Paxos. It layers processing. On the server side, it specifies the processing
transaction processing over the inconsistent replication of of the last operation and the normal commit process,
highly-availabledatastores,andexploitsthehighavailability as well as the recovery process on client or participant
of data for participant replica recovery. TAPIR also returns failures. Except for the last operation, all transaction
the transaction outcome to the client within one processing operationscanbeprocessedfollowingeithertheinconsistent
phase of commit, but the transaction outcome is only replication solutions [10], [14] or the consistent replication
visible to other transactions after two phases. It has solutions [12], [15]. Different concurrency control schemes
strongrequirementsfor applications,e.g.,pairwise invariant and isolation levels [9] can be used with HACommit, e.g.,
checks and consensus operation result reverse. Replicated optimistic, multi-version,lock-basedconcurrencycontrolor
read-committed, serializable isolation levels. On processing instanceforcommit.Thisconfigurationinformationmustbe
the last transaction operation, participants vote for a tran- known to all acceptors of Paxos.
saction commit based on the results of local concurrency In case when inconsistent replication [10] is used in op-
control, integrity and consistency checks. erationprocessing,thetransactioncontextmustalsoinclude
A HACommit application begins a transaction, starting relevantwrites. Relevantwritesarewritesoperatingondata
the transaction execution phase. It can then execute reads heldbyaparticipantanditsreplicas.Therelevantwritesare
and writes in the transaction. On the last operation, the necessary in case of participant failures. With inconsistent
clientindicatesto all participantsthatitis the lastoperation replication, participant replicas might not process the same
of the transaction. All participants check locally whether writes for a transaction as the participant. Consider when a
to vote YES or NO for a commit. They replicate their set of relevant writes are known to the participant but not
votesandthetransactioncontextinformationtotheirreplicas its replicas. The client might fail after sending the Commit
respectively before responding to the client. The client will decisionto participants.Inthemeantime,a participantfails
receive votes for a commit from all participants, as well as and one of its replica acts as the new participant. Then, the
the processing result for the last operation. This is the end recovery proposers propose the same Commit decision. In
of the execution phase. Then, the client can either commit such case, the new participant will not know what writes to
or abort the transaction, though the client can only commit apply when committing the transaction. To reduce the data
the transaction if all participants vote YES [32]. kept in the transaction context, the relevant writes can be
The atomic commit process starts when the client pro- recorded as commands [33].
poses the transaction decision to the participants and their
V. THECOMMIT PROCESS
replicas.Oncetheclient’sdecisionisreceivedbymorethan
a replica quorum of any participant, HACommit will guar- In HACommit, the client commits or aborts a transaction
antee thatthetransactionis committedorabortedaccording by initiating a Paxos instance.
to the client’s decision despite failures of the client or the
A. Background: the Paxos Algorithm
participant replicas. Therefore, the client can safely end
the transaction once it has received acknowledgement from A run of the Paxos algorithm is called an instance.
a replica quorum of any participant. The transaction will A Paxos instance reaches a single consensus among the
be committed at all participant replicas once they receive participants.Aninstanceproceedsinrounds.Eachroundhas
the client’s commit decision. An example of transaction a ballot with a unique number bid. Any would-be proposer
processing using HACommit is illustrated in Figure 1. can start a new roundon any(apparent)failure.Each round
generallyconsistsoftwophases[34](phase-1andphase-2),
and each phase involves one communicationroundtrip. The
IV. PROCESSING THELASTOPERATION
consensusis reachedwhen one active proposersuccessfully
On processing the last operation of the transaction, finishesoneround.Participantsintheconsensusproblemare
the client sends the last operation to participants holding generallycalledacceptorsinPaxos.However,inaninstance
relevant data, indicating about the last operation. For other ofPaxos,ifaproposeristheonlyandtheinitialproposer,it
participants, the client sends an empty operation as the can propose any value to be accepted by participants as the
last operation. All participants process the last operation– consensus, incurring one communication roundtrip between
those receiving an empty operation does no processing. the proposer and the participants [35].
They check locally whether a commit for the transaction Paxosiscommonlyusedinreachingtheconsensusamong
can violate any ACID property and vote accordingly. They a set of replicas. Each Paxos instance has a configuration,
replicate their votes and the transaction context to their whichincludesthesetofacceptorsandlearners.Widelyused
replicas respectively before responding to the client. The in reaching replica consensus, Paxos is generally used with
replication of participant votes and the transaction context itsconfigurationstayingthesameacrossinstances[36].The
is required to survive the votes and guarantee voting configuration information must be known to all proposers,
consistency in case of participant failures. The participants acceptors and learners. Take data replication for example.
piggyback their votes on their response to the client’s last The set of data replicas are acceptors and learners. The
operation request after the replication. The client makes its leader replica is the initial proposer and all other replicas
decision on receiving responses from all participants. are would-be proposers. Clients send their writing requests
Transaction context. The transaction context must in- to the leader replica, which picks one write or a write
clude the transaction ID and the shard IDs. The transaction sequence as its proposal. Then the leader replica starts a
ID uniquely identifies the transaction and distinguishes Paxos instance to propose its proposal to the acceptors.
the Paxos instance for the commit process. The shard In practice, the configuration can stays the same across
IDs are necessary to compute the set of participant IDs, different Paxos instances, e.g., writes to the same data at
which constitute the configuration informationof the Paxos different time.
B. The One-Phase Commit Process instances for commit, as the participant can involve in
multiple concurrent transactions. To distinguish different
InHACommit,theclientistheonlyandtheinitialpropo-
transactions, we include a transaction ID in the phase-2
ser of the Paxos instance, as each transaction has a unique
message, as well as in all messages sent between clients
client. As a result, the client can commit the transaction in
and participants. A transaction T is uniquely identified in
one communication roundtrip to the participants.
the system by its ID tid, which can be generated using
The commitprocessstarts from the second phase (phase-
distributed methods, e.g., UUID [37].
2) of the Paxos algorithm. That is, the client first sends
a phase-2 message to all participants. To guarantee the E. Paxos Configuration Information
correctness, the exploitation of the Paxos algorithm must
Different from those Paxos exploitations where the
strictly comply with the algorithmspecification. Complying
configuration stay the same across multiple instances,
with the Paxos algorithm, the phase-2 message includes
HACommit has different configurations in Paxos instances
a ballot number bid, which is equal to zero, and the
for different transaction commits. The set of participants is
proposal for commit, which can be commit or abort. On
the configuration of a Paxos instance. Each transaction has
receiving the phase-2 message, a participant records the
different participants, leading to different configurations of
ballot number and the outcome for the transaction locally.
Paxos instances for commit. As required by the algorithm,
Then it commits the transaction by applying the writes and
theconfigurationmustbeknowntoallproposersandwithin
releasing all data items; or, it aborts the transaction by
the configuration. A replacing proposer (i.e., a recovery
rolling back the transaction and releasing all data items.
node) needs the configuration information to continue the
In the mean time, the participant invokes the replication
algorithm after the failure of a previous proposer. The
layer to replicate the result to its replicas. Afterwards, each
first proposer of the commit instance is the transaction
participantacknowledgesthe client. Alternatively,the client
client, which is the only node with complete information
can send the phase-2 message to all participants and their
of the configuration. If the client fails, the configuration
replicas. Each participant replica also follows the same
informationmightgetlost. Infact,a clientmightfailbefore
processing procedure as its participant’s. Then, the client
the transaction comes to the commit step. Then a replacing
waits responses from all participants and their replicas.
proposerwill hardly have enough configurationto abort the
C. Participant Acknowledgements dangling transaction.
To guarantee the availability of the configuration infor-
Foranyparticipant,iftheacknowledgementsbyaquorum
mation, we include the configuration information in the
of its replicas are received by the client, the client can
phase-2message.Besides, astheconfigurationisexpanding
safelyendthetransaction.Infact,thecommitprocessisnot
and updating after a new operation is processed, the client
finished until all participants acknowledge the client. But
must send the up-to-date configuration to all participants
any participant failing to acknowledge can go through the
contacted so far on processing each operation. In case that
failure recovery process (Section VI) to successfully finish
a participant fails and one of its replicas take its place, the
the commit process. In HACommit, all participants must
configurationmust be updated and sent to all replicas of all
finally acknowledge the acceptance of the client’s proposal
participants. The exact configuration of the Paxos instance
so that the transaction is committed at all data operated by
for commit will be formed right on the processing of the
the transaction.
last transactional operation. In this way, each participant
The requirement for participants’ acknowledgements is
replicakeepslocallyanup-to-datecopyoftheconfiguration
differentfromthatforthequorumacceptanceintheoriginal
information.As a participantcan fail and be replacedby its
Paxos algorithm. In Paxos, the Consensus is reached if a
replicas, HACommit does not rely on participant IDs for
proposalisacceptedbymorethana quorumofparticipants.
the configurationreference.Instead,itrecordsthe IDsofall
The original Paxos algorithm can tolerate the failures of
shardsoperatedbythetransaction.WiththesetofshardIDs,
bothparticipants(acceptors)andproposers.HACommituses
any serverin the system shall find outthe contemporaryset
the client as the initial proposer and the participants as
of participants easily.
acceptorsandwould-beproposerswhenexploitingPaxosfor
the commit process. In its Paxos exploitation, HACommit VI. FAILURERECOVERY
only tolerates the failures of the initial proposerand would- In the design of HACommit, we assume that, if a client
be proposers. However, the failure of participants (i.e., or a participant replica fails, it can only fail by crashing.
acceptors) can be tolerated by the participant replication, In the following,we describesthe recoverymechanismsfor
which can also exploit consensus algorithms like Paxos. client failure and participant replica failure respectively.
D. Distinguishing Concurrent Commits A. On Client Failure
Each Paxos instance corresponds to the commit of one In HACommit, all participants are all candidates of
transaction,butoneparticipantcanengageinmultiplePaxos recovering nodes for a failure. We call recovering nodes as
recovery proposers, which act as would-be proposersof the 2) Liveness: To guarantee liveness, HACommit adopts
commit process. The recovery proposers will be activated the assumption commonly made for Paxos. That is, one
on client failure. In an asynchronous system, there is no proposer will finally succeed in finishing one round of
way to be sure about whether a client actually fails. In the algorithm. In HACommit, if all participants consider
practical implementations, a participant can keep a timer the current proposer as failed and starts a new round
on the duration since it has received a message from the of Paxos simultaneously, a racing condition among new
current proposer. If the duration has exceeded a threshold, proposers could be formed in the first phase of Paxos.
theparticipantconsidersthecurrentproposerasfailed.Then No proposer might be able to succeed in finishing the
it considers itself as the recovery proposer. second phase of Paxos, making the liveness of commit not
A recovery proposer must run the complete Paxos guaranteed. Though rarely happening, the racing condition
algorithm to reach the consensus safely among the par- among would-be proposers must be avoided in Paxos [19]
ticipants. As any would-be proposer can start a new for the liveness consideration. In actual implementations,
round on any (apparent) failure, multiple rounds, phases the random back-off of candidates is enough to resolve the
and communications roundtrips will be involved on client racing situation [34], [36]; or, some leader election [34]
failures. or failure detection [38] services outside the algorithm
Although complicated situations can happen, the par- implementation might be used.
ticipants of a transaction will reach the same outcome
B. On Participant Replica Failures
eventually, if they ever reach a consensus and the tran-
saction ends. For example, as delayed messages cannot be HACommit can tolerate not only client failures, but also
distinguished from failures in an asynchronous system, the participantreplica failures. It can guaranteecontinuousdata
current proposer might in fact have not failed. Instead, its availabilityif morethan a quorumof replicasare accessible
last message has not reached a participant, which considers for each participant in a transaction. In case that quorum
theproposerasfailed.Or,multipleparticipantsconsidersthe replicaavailabilitycannotbeguaranteed,HACommitcanbe
current proposer as failed and starts a new round of Paxos blocked but the correctnessof atomic commit is guaranteed
simultaneously.Allthesesituationswillnotimpairthesafety anyhow[19].Thehighavailabilityofdataenablesarecovery
of the Paxos algorithm [19]. process based on replicas instead of logging, though
1) The Recovery Process: A recoveryproposerstarts the logging and other mechanisms like checkpointing [26] and
recovery process by starting a new round of the Paxos asynchronous logging [33] can fasten the recovery process.
instance from the first phase. In the first phase, the new Failed participant replicas can be recovered by copying
proposerwillupdatethe ballotnumberbid tobe largerthan data from the correct replicas of the same participant.
anyoneithasseen.Itsendsaphase-1messagewiththenew Or, recovery techniques used in consensus and replication
ballot number to all participants. On receiving the phase-1 services[39],[40]canbe employedforthe replicarecovery
message with bid, if a participant has never received any of participants. Although one replica is selected as the
phase-1 message with ballot number greater than bid, it leader (i.e., the participant), the leader replica can easily
respondstotheproposer.Theresponseincludestheaccepted bereplacedbyotherreplicasofthesameparticipant[39].If
transaction decision and the ballot number on which the a participant failed before sending its vote to its replicas,
acceptance is made, if the participanthas ever accepted any the new leader will make a new decision for the vote.
transaction decision. Otherwise, as the vote of a participant is replicated before
If the proposer has received responses to its phase-1 sending to the coordinator, this vote can be kept consistent
message from all participants, it sends a phase-2 message duringthechangeofleaders.Besides,theclienthastsentthe
to allparticipants.The phase-2message hasthe same ballot transaction outcome to all participants and their replicas in
number as the proposer’s last phase-1 message. Besides, the commit process. Thus, failed participant replicas can be
the transaction outcome with the highest ballot number in recoveredcorrectly as long as the number of failed replicas
the responses is proposed as the final transaction outcome; for a participant is tolerable by the consensus algorithm in
or, if no accepted transaction outcome is included in use.
responses to the phase-1 message, the proposer proposes We assume there are fewer failed replicas for each
ABORT to satisfy the assumptions of the CAC problem. participant than that is tolerable by the highly-available
Unless the participant has already responded to a phase- datastore. This is generally satisfied, as the number of
1 message having a ballot number greater than bid, replicas can be increased to tolerate more failures. If
a participant accepts the transaction outcome and ends unfortunately the assumption is not met, the participant
the transaction after receiving the phase-2 message. The withoutenoughreplicaswillnotrespondtotheclientsoasto
participant acknowledges the proposer accordingly. After guarantee replica consistency and correctness. The commit
receiving acknowledgements from all participants, the new process will have to be paused until all participants have
proposer can safely end the transaction. enough active replicas. Though not meeting the assumption
can impair the liveness of the protocol, HACommit can 2.5
HACommit
guarantee the correctness of commit and the consistency of
data anyhow. ms]2.0 2PC
y [ RCommit
VII. EVALUATION nc1.5
e
t
a
Our evaluationexploresthree aspects: (1) commitperfor- e l1.0
mance of HACommit—it has smaller commit latency than g
a
r
other protocols and this advantage increases as the number ve0.5
A
of participants per transaction increases; (2) fault tolerance
of HACommit—it can tolerate client failures, as well as 0.0
1 4 8 16 32 64
server failures; and, (3) transaction processing performance
Operation # per transaction [ops/txn]
ofHACommit—ithashigherthroughputsandloweraverage
latencies than other protocols. Figure2. Commitlatencieswhenincreasingthenumberofoperationsper
transaction.
A. Experimental Setup
with the first and last quarter of each trial elided to avoid
We compare HACommit with two-phase commit (2PC),
start up and cool down artifacts. For all experimental runs,
replicated commit (RCommit) [15] and MDCC [14]. Two-
clients recorded throughput and response times. We report
phasecommit(2PC)isstillconsideredthestandardprotocol
the average of three sixty-second trials.
for committing distributed transactions. It assumes no
In all experiments, we preload a database containing a
replication and is not resilient to single node failures.
singletablewith10millionrecords.Eachrecordhasasingle
RCommit and MDCC are state-of-the-art commit protocols
primary key column and 1 additional column each with
for distributed transactions over replicated as HACommit
10 bytes of randomly generated string data. We use small-
is. It has better performance than the approach that layers
size records to focus our attention on the key performance
2PC over the Paxos algorithm [12], [36]. MDCC guaran-
factors. Accesses to records are uniformly distributed over
tee only isolation levels weaker than serializability. The
the whole database. In all workloads, transactions are
same concurrency control scheme and the same storage
committed if no data conflicts exist. That is, all transaction
management component are used for HACommit, 2PC and
aborts in the experiments are due to concurrency control
RCommit. These three implementationsuse the consistency
requirements.
levelofserializability.Comparedtotheimplementationsfor
2PC and RCommit, the HACommit implementation also B. Commit Performance
supports the weak isolation level of read committed [41]. As we are targeting at transaction commit protocols,
The evaluationof MDCC is based on its opensources [42]. we first examine the actual costs of the commit process.
We evaluate all implementations using the Amazon We study the duration of the commit process. We do not
EC2 cloud. We evaluate each implementation using a compare the commit process of HACommit with that of
YCSB-based benchmark [20]. As our database completely MDCC because the latter integrates a concurrency control
resides in memory and the network communication plays process; comparing only the commit process of the two
an important role, we deploy the systems over memory- protocols is unfair to MDCC.
optimizedinstancesofr3.2xlarge(with8cores,60GBmem- HACommit outperforms 2PC and RCommit in varied
ory and high-speed network). Unless noted otherwise, all workloads. Figure 2 shows the latencies of commit. We
implementations are deployed over eight nodes. The cross- vary the number of operations per transaction from 1 to
node communication roundtrip is about 0.1 milliseconds. 64. The advantage of HACommit increases as the number
For HACommit, RCommit and MDCC, the database is of operations per transaction increases. When a transaction
deployedwiththreereplicas.For2PC,noreplicationisused. has 64 operations, HACommit can commit in one fifth of
Generally, 2PC requires buffer management for durability. the time 2PC does. This performanceis more significant as
We do not include one for 2PC and in-memory database is it seems, as HACommit uses replication and 2PC does not.
usedinstead.Thedurabilityisguaranteedthroughoperation That means, HACommit has n−1 times more participants
logging. As buffer management takes up about one fifth than 2PC in the commit, where n is the numberof replicas.
of the local processing time of transactions, our 2PC HACommit’s commit latency increases slightly as the
implementation without buffer managementshould perform number of operations increases to 20. On committing a
faster than a typical 2PC implementation. transaction, the system must apply all writes and release
In all experiments, each server runs a server program all locks.When the numberof operationsis small, applying
and a test client program. By default, each client runs with writes and releasing locks in the in-memory database cost
10 threads. Each data point in the graphs represents the a small amount of time, as compared to the network
median of at least five trials. Each trial is run for over 120s communication roundtrip time (RTT). As the number of
5000 6000
Client
e latency [ms]4433050500000000 123///555 rrreeepppllliiicccaaa ssf affaailiisll ghput [txns/s]435000000000 RReepplliiccaa12 1 123 4246 8795 9 199 TRRieemppeaa iiorrieundtg
Averag22050000 Throu21000000 123///555 rrreeepppllliiicccaaa ssf affaailiisll RRReeepppllliiicccaaa345 11011000101110001111001000 11100010 1100
15000 50 100 150 180 00 50 100 150 180 Time (ms) 09394 97 166 4098 49222
Time [s] Time [s]
Figure3. Transactionlatencyvariationsduring Figure 4. Transaction throughput variations Figure 5. HACommit’s behavior on a client failure
serverfailures. duringserverfailures. (circled numbersaretransactions).
operations increases, the time needed increases slightly for replicas are available for every data item, HACommit can
applyingallwritesin the in-memorydatabase.Accordingly, process transactions normally.
the commit latency of HACommit increases. We also examine how HACommit behaves under tran-
2PCandRCommithaveincreasedcommitlatencieswhen saction client failures. We deliberately kill the client in an
the number of operations per transaction increases. They experiment. Each server program periodically checks its
need to log new writes for commit and the old values of local transaction contexts to see if any last contact time
dataitemsforrollback,thusthetimeneededfortheprepare exceeds a timeout period. We set the timeout period to be
phase increases as the number of writes goes up, leading to 15 seconds. Figure 5 visualizes the logs on client failures
a longer commit process. 2PC has a higher commit latency and demonstrates how participants recover from the client
than RCommit, because in our implementations, 2PC must failure.
log in-memory data and RCommit relies on replication for In Figure 5, replicas represent participants. The cross at
fault tolerance. theclientlinerepresentsthefailureoftheclient.Thecircled
numbersrepresentunendedtransactions.Thetimeaxisatthe
bottom stretches from left to right. The moment when the
C. Fault-Tolerance
first transaction is detected to be unended is denoted as the
In the fault-tolerance tests, we examine the behaviors of beginning of the time axis. A transaction is named on the
HACommit under both client failures and server failures. time that it is discovered to be unended, i.e., transaction 1
The evaluation result demonstrates that no transaction is is the first transaction to be detected not to be unended. A
blocked under server failures and the client failure, as long replica can detect a transaction is unended because there
as a quorum of participant replicas are accessible. are timeouts on the last time when a processing message is
We use five replicas and initiate one client in the fault received at the replica. Timeouts are specified by arrows.
tolerance tests. To simulate failures, we actively kill a Replica 1 has the smallest node ID. It detects unended
process in the experiments. The network module of our transactions1to9andstartspushingthemtoanendthrough
implementations can instantly return an error in such case. a repairing process. We have synchronized the clocks of
Our implementation processes the error as if connection nodes. For simplicity, we use the moment when replica
timeout on node failures happens. 1 detects transaction 1 has a last contact time exceeding
Figure 3 shows the evolution of the average transaction the timeout period as the beginning of the time axis. It
latency in a five replica setup that experiences the failure takes about 100 milliseconds for replica 1 to repair each
of one replica at 50, 100 and 180 seconds respectively. transaction. Replica 1 aborts the nine transactions in the
The corresponding throughputs are shown in Figure 4. The repairing process because no transaction outcome has ever
latencies and throughputs are captured for every second. been accepted by any replica. The transaction 10 is later
At 50 and 100 seconds, the average transaction latency detected by replica 4. However, replica 4 waits for four
decreases and the throughput increases. With PCC, reads timeoutperiodsbeforeitactuallyinitiatesarepairingprocess
in the HACommit implementation take up a great portion forthetransaction.Thereasonisthattransaction10hasbeen
of time. The failure of one replica means that the system committed at replica 1, 2 and 3. Replica 4 finally commits
can process fewer reads. Hence, this leads to lower average the transaction. Replica 5 also detects transaction 10, but it
latencies and higher throughputs for read transactions, as doesnotinitiateanyrepairingprocess.Beforereplica5starts
well as for all transactions. At 180 seconds, we failed one repairingtransaction10,thetransactionisalreadycommitted
more replicas, violating the quorum availability assumption in the repairing process initiated by replica 4.
of HACommit. The thoughput drops to zero immediately
D. Transaction Throughput and Latency
because no operation or commit process can succeed at
all. The HACommit implementation uses timeouts to detect We evaluate the transaction throughputand latency when
failures and quorum reads/writes. As long as a quorum of usingdifferentcommitprotocols.Intheexperiments,onthe
hput [txns/s]11802000000000 HRCAoCmommmitTitxTnxn cy [ms]44330505 HRCAoCmommmitTitxTnxn cy [ms]323055 HRCAoCmommmit iUt UpdpadtaetTexTnxn
ug en25 en20
nsaction thro 462000000000 Average lat2110505 Average lat11505
Tra 0 1 2 4 8 10 16 20 32 64 014 8 16 32 64 014 8 16 32 64
Operation # per transaction [ops/txn] Operation # per transaction [ops/txn] Operation # per transaction [ops/txn]
Figure6. Transactionthroughput:HACommitvs. Figure7. Transactionaveragelatency:HACommit Figure 8. Latency of update transaction:
RCommit. vs.RCommit. HACommitvs.RCommit.
hput [txns/s]11112468000000000000 HMADCCoCmTmxnitTxnRC cy [ms]11802 HMADCCoCm Umpidt aUtpedTaxtneTxnRC cy [ms]6987 HMADCCoCm Rmeiat dRTexandTxnRC
ug10000 en en
nsaction thro 4682000000000000 Average lat 462 Average lat4352
Tra 0 1 2 4 8 10 16 20 32 0 1 4 8 10 16 20 32 1 1 4 8 10 16 20 32
Operation # per transaction [ops/txn] Operation # per transaction [ops/txn] Operation # per transaction [ops/txn]
Figure 9. Transaction throughput under read- Figure10. LatencyofUPDATEtransactionsunder Figure 11. Latency of READ transactions under
committedCC:HACommitvs.MDCC. read-committed CC:HACommitvs.MDCC. read-committed CC:HACommitvs.MDCC.
failure of lock acquisition, we retry the same transaction HACommit in that it acquires no locks on reads. Figure
until it successfully commits. Each retry is made after a 9 shows the transaction throughputs for HACommit-RC
random amount of time. and MDCC. The latencies of update transactions and
Figure 6 shows the transaction throughputs when using read transactions are shown in Figure 10 and Figure 11.
HACommit and RCommit, and Figure 7 demonstrates the HACommit-RC has larger transaction throughputs than
average transaction latencies. The HACommit implementa- MDCCinallworkloads.Thelatenciesofupdatetransactions
tion has larger transaction throughputs than the RCommit are lower in the HACommit-RC implementation than in
implementation in all workloads. Besides, HACommit has the MDCC implementation, although they have similar
lower transaction latencies than RCommit in all workloads. performances in read transactions. Both HACommit-RC
HACommit’s advantage on transaction latency increases and MDCC implement read transactions similarly and
as the number of operations in a transaction increases guarantee the read-committedconsistency level. The reason
in the workloads. As both implementations use the same that HACommit-RC has better performance in transaction
concurrency control and isolation level, factors leading to throughput and update transaction latency is as follows.
HACommit’s advantage over RCommit are two-fold. First, MDCC uses optimistic concurrency control, which can
nocostlyloggingisinvolvedduringthecommit.Second,no cause high abort rates under high contention, leading
persistence of data is needed. to lower performance than HACommit-RC, which uses
We compare the update transaction latencies of HA- pessimistic concurrencycontrol.Besides, MDCCholdsdata
Commit and RCommit in Figure 8. Both implementations byoutstandingoptions,leadingtothesameeffectoflocking
use the same concurrency control scheme and consistency in committed transactions.
level. We can see that HACommit outperforms RCommit.
As the number of operations increases in the workloads, VIII. CONCLUSION
HACommit’s advantage also increases. The advantage of We have proposed HACommit, a logless one-phase
HACommit is still due to a commit without logging and commit protocol for highly-available datastores. In contrast
data persistence. to the classic vote-after-decide approach to distributed
We also examine the transaction throughput and latency commit, HACommit adopts the vote-before-decide ap-
when using weaker isolation levels with HACommit. In proach. In HACommit, the procedure for processing the
this case, we compare HACommit against MDCC. We last transaction operation is redesigned to overlap the last
implemented HACommit-RC with the read-committed iso- operation processing and the voting process. To commit a
lation level [41]. This is an isolation level comparable to transactioninonephase,HACommitexploitsPaxosanduses
that guaranteed by MDCC. HACommit-RC differs from the unique client as the initial proposer. To exploit Paxos,
HACommit designs a transaction context structure to keep [12] J.C.Corbett,J.Dean,M.Epstein,A.Fikes,C.Frost,J.Fur-
Paxos configuration information. Although client failures man, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild
et al., “Spanner: Google’s globally-distributed database,”
can be tolerated by the Paxos exploitation, HACommit
Proceedings of OSDI, p. 1, 2012.
designs a recovery process for client failures such that
the transaction can actually end with the transaction data [13] L. Glendenning, I. Beschastnikh, A. Krishnamurthy, and
visible to other transactions. For participantreplica failures, T.Anderson,“Scalableconsistencyinscatter,”inProceedings
HACommit has participants replicate their votes and the of SOSP. ACM, 2011, pp. 15–28.
transactionmetadatatotheirreplicas;and,afailurerecovery
[14] T. Kraska, G. Pang, M. J. Franklin, and S. Madden, “Mdcc:
process is proposed to exploit the replicated votes and
Multi-data center consistency,” in Eurosys, 2013.
metadata. Our evaluation demonstrates that HACommit
outperforms recent atomic commit solutions for highly- [15] H. A. Mahmoud, A. Pucher, F. Nawab, D. Agrawal, and
available datastores. A.E.Abbadi,“Lowlatencymulti-datacenterdatabasesusing
replicatedcommits,”inProc.oftheVLDBEndowment,2013.
ACKNOWLEDGMENT
[16] C. Mohan, B. Lindsay, and R. Obermarck, “Transaction
This work is also supported in part by the State Key
management in the r* distributed database management
Development Program for Basic Research of China (Grant
system,”ACMTrans.DatabaseSyst.,vol.11,no.4,pp.378–
No. 2014CB340402) and the National Natural Science 396, Dec. 1986.
Foundation of China (Grant No. 61303054).
[17] M. Abdallah, R. Guerraoui, and P. Pucheral, “One-phase
REFERENCES commit: does it make sense?” in Proc. of International
[1] J. Shute, R. Vingralek, B. Samwel, B. Handy, C. Whipkey, ConferenceonParallelandDistributedSystems. IEEE,1998,
E.Rollins,M.Oancea,K.Littlefield,D.Menestrina,S.Ellner, pp. 182–192.
J. Cieslewicz, I. Rae, T. Stancescu, and H. Apte, “F1: A
distributed sql database that scales,” Proc. VLDB Endow., [18] M. K. Aguilera, “Stumbling over consensus research:
vol. 6, no. 11, pp. 1068–1079, Aug. 2013. Misunderstandings and issues,” in Replication. Springer,
2010, pp. 59–72.
[2] “Amazon cloud goes down friday night, taking netflix,
instagram and pinterest with it,” October 2012, [19] L. Lamport, “The part-time parliament,” ACM Transactions
http://www.forbes.com/sites/anthonykosner/2012/06/30/ on Computer Systems, vol. 16, no. 2, pp. 133–169, 1998.
amazon-cloud-goes-down-friday-night-taking-netflix-
instagram-and-pinterest- with-it/. [20] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and
R. Sears, “Benchmarking cloud serving systems with ycsb,”
[3] R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, in Proceedings of the 1st SoCC. ACM, 2010.
H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab et al.,
“Scaling memcache at facebook.” in nsdi, vol. 13, 2013, pp. [21] J.GrayandL.Lamport,“Consensusontransactioncommit,”
385–398. ACMTrans.DatabaseSyst.,vol.31,no.1,pp.133–160,Mar.
2006.
[4] “Paypal,” https://www.paypal.com/.
[22] R.Guerraoui, M.Larrea,andA.Schiper,“Reducingthecost
[5] “Alipay,” https://www.alipay.com/.
fornon-blockinginatomiccommitment,”inProc.ofICDCS.
IEEE, 1996, pp. 692–697.
[6] “Baidu wallet,”https://www.baifubao.com/.
[7] D. Peng and F. Dabek, “Large-scale incremental processing [23] R. Guerraoui and A. Schiper, “The decentralized non-
using distributed transactions and notifications.” in OSDI, blocking atomic commitment protocol,” in Proc. of IEEE
vol. 10, 2010, pp. 1–15. Symposium on Parallel and Distributed Processing. IEEE,
1995, pp. 2–9.
[8] J. Goldstein and P.-A˚. Larson, “Optimizing queries using
materialized views: a practical, scalable solution,” in ACM [24] Y.Sovran,R.Power,M.K.Aguilera,andJ.Li,“Transactional
SIGMODRecord, vol.30,no. 2. ACM,2001, pp.331–342. storageforgeo-replicatedsystems,”inProc.ofSOSP’11,pp.
385–400.
[9] P. A. Bernstein, V. Hadzilacos, and N. Goodman, Concur-
rency control and recovery in database systems. Addison- [25] S. Mu, Y. Cui, Y. Zhang, W. Lloyd, and J. Li, “Extracting
wesley New York, 1987, vol. 370. more concurrency from distributed transactions,” in Proc. of
OSDI, 2014.
[10] I. Zhang, N. K. Sharma, A. Szekeres, A. Krishnamurthy,
and D. R. Ports, “Building consistent transactions with [26] E. P. Jones, D. J. Abadi, and S. Madden, “Low overhead
inconsistent replication,”inProceedings of SOSP ’15. New concurrencycontrolforpartitionedmainmemorydatabases,”
York, NY, USA: ACM, 2015. in Proc. of SIGMOD. ACM, 2010, pp. 603–614.
[11] F. Nawab, V. Arora, D. Agrawal, and A. El Abbadi,
“Minimizingcommitlatencyoftransactionsingeo-replicated
data stores,” in Proceedings of SIGMOD’15. ACM, 2015,
pp. 1279–1294.