Table Of ContentToo good to be true: when overwhelming evidence fails to convince
Lachlan J. Gunn,1,∗ Franc¸ois Chapeau-Blondeau,2,† Mark D. McDonnell,3,‡
Bruce R. Davis,1,§ Andrew Allison,1,¶ and Derek Abbott1,∗∗
1School of Electrical and Electronic Engineering,
The University of Adelaide 5005, Adelaide, Australia
2Laboratoire Angevin de Recherche en Ing´enierie des Syst`emes (LARIS),
University of Angers, 62 avenue Notre Dame du Lac, 49000 Angers, France
3School of Information Technology and Mathematical Sciences,
University of South Australia, Mawson Lakes SA, 5095, Australia.††
Is it possible for a large sequence of measurements or observations, which support a hypothesis,
tocounterintuitivelydecreaseourconfidence? Canunanimoussupportbetoogoodtobetrue? The
assumption of independence is often made in good faith, however rarely is consideration given to
whether a systemic failure has occurred.
6 Taking this into account can cause certainty in a hypothesis to decrease as the evidence for
1 it becomes apparently stronger. We perform a probabilistic Bayesian analysis of this effect with
0 examplesbasedon(i)archaeologicalevidence,(ii)weighingoflegalevidence,and(iii)cryptographic
2 primality testing.
We find that even with surprisingly low systemic failure rates high confidence is very difficult to
n
achieve and in particular we find that certain analyses of cryptographically-important numerical
a
tests are highly optimistic, underestimating their false-negative rate by as much as a factor of 280.
J
5
INTRODUCTION when they can—surprisingly—disimprove confidence in
]
P the final outcome. We choose the striking example that
A increasingconfirmatoryidentificationsinapoliceline-up
In a number of branches of science, it is now well-
. known that deleterious effects can conspire to produce or identity parade can, under certain conditions, reduce
t
a ourconfidencethataperpetratorhasbeencorrectlyiden-
a benefit or desired positive outcome. A key example
t tified.
s where this manifests is in the field of stochastic reso-
[
nance [1–3] where a small amount of random noise can Imagine that as a court case drags on, witness after
1 surprisinglyimprovesystemperformance,providedsome witness is called. Let us suppose thirteen witnesses have
v aspect of the system is nonlinear. Another celebrated testified to having seen the defendant commit the crime.
0
example is that of Parrondo’s Paradox where individu- Witnesses may be notoriously unreliable, but the sheer
0
ally losing strategies combine to provide a winning out- magnitude of the testimony is apparently overwhelming.
9
0 come [4, 5]. Anyone can make a misidentification but intuition tells
0 Looselyspeaking,asmallamountof‘bad’canproduce us that, with each additional witness in agreement, the
1. a ‘good’ outcome. But is the converse possible? Can too chance of them all being incorrect will approach zero.
0 much ‘good’ produce a ‘bad’ outcome? In other words, Thus one might na¨ıvely believe that the weight of as
6 can we have too much of a good thing? many as thirteen unanimous confirmations leaves us be-
1 yond reasonable doubt.
The answer is affirmative—when improvements are
:
v made that result in a worse overall outcome this situa- However, this is not necessarily the case and more
i
X tion is known as Verschlimmbesserung [6] or disimprove- confirmationscansurprisinglydisimproveourconfidence
ment. Whilst this converse paradigm is less well known that the defendant has been correctly identified as the
r
a in the literature, a key example is the Braess Paradox perpetrator. This type of possibility was recognised in-
where an attempt to improve traffic flow by adding by- tuitivelyinancienttimes. UnderancientJewishlaw[13],
pass routes can counterintuitively result in worse traffic one could not be unanimously convicted of a capital
congestion [7–9]. Another example is the truel, where crime—it was held that the absence of even one dissent-
three gunmen fight to the death—it turns out that un- ing opinion among the judges indicated that there must
der certain conditions the weakest gunman surprisingly remain some form of undiscovered exculpatory evidence.
reduces his chances of survival by firing a shot at either Such approaches are greatly at odds with standard
of his opponents [10]. These phenomena can be broadly practice in engineering, where measurements are often
considered to fall under the class of anti-Parrondo ef- taken to be independent. When this is so, each new
fects [11, 12], where the inclusion of winning strategies measurement tends to lend support to the outcome with
fail. which it most concords. An important question, then, is
Inthispaper,forthefirsttime,weperformaBayesian todistinguishbetweenthetwotypesofdecisionproblem;
mathematicalanalysistoexplorethequestionofmultiple thosewhereadditionalmeasurementstrulylendsupport,
confirmatory measurements or observations for showing and those for which increasingly consistent evidence ei-
2
ther fails to add or actively reduces confidence. Other- in Roman-occupied Britain or whether it was brought
wise,itisonlylaterwhentheresultscomeunderscrutiny from Italy by travelling merchants. Suppose that we are
that unexpectedly good results are questioned; Mendel’s fortunate and that a test is available to distinguish be-
plant-breeding experiments provide a good example of tweentheclayfromthetworegions;clayfromonearea—
this [14, 15], his results matching their predicted values letussupposethatitisBritain—containsatraceelement
sufficientlywellthattheirauthenticityhasbeenmiredin which can be detected by laboratory tests with an error
controversy since the early 20th century. ratep =0.3. Thisisclearlyexcessive,andsowerunthe
e
The key ingredient is the presence of a hidden fail- test several times. After k tests have been made on the
ure state that changes the measurement response. This pot, the number of errors will be binomially-distributed
change may be a priori quite rare—in the applications E ∼Bin(k,pe). If the two origins, Britain and Italy, are
that we shall discuss, it ranges from 10−1 to 10−19—but a priori equally likely, then the most probable origin is
whenseveralobservationsareaggregated,theaposteriori the one suggested by the greatest number of samples.
probabilityofthefailurestatecanincreasesubstantially, Nowimaginethatseveralmanufacturersofpotteryde-
and even come to dominate the a posteriori estimate of liberatelyintroducedlargequantitiesofthiselementdur-
themeasurementresponse. Weshallshowthatbyinclud- ingtheirproductionprocess,andthatthereforeitwillbe
ing error rates, this changes the information-fusion rule detected with 90% probability in their pots, which make
in a measurement dependent way. Simple linear super- up pc = 1% of those found; of these, half are of British
position no longer holds, resulting in non-monotonicity origin. Wecallpcthecontaminationrate. Thisisthehid-
that leads to these counterintuitive effects. denfailurestatetowhichwealludedintheintroduction.
This paper is constructed as follows. First, we intro- Then, after the pot tests positive several times, we will
duce an example of a hypothetical archaeological find: become increasingly certain that it was manufactured in
a clay pot from the Roman era. We consider multiple Britain. However, as more and more test results are re-
confirmatory measurements that decide whether the pot turned from the laboratory, all positive, it will become
was made in Britain or Italy. Via a Bayesian analysis, moreandmorelikelythatthepotwasmanufacturedwith
we then show that due to failure states, our confidence this unusual process, eventually causing the probability
in the pot’s origin does not improve for large numbers of of British origin, given the evidence, to fall to 50%. This
confirmatorymeasurements. Webeginwiththisexample is the essential paradox of the system with hidden fail-
of the pot, due to its simplicity and that it captures the urestates—overwhelmingevidencecanitselfbeevidence
essential features of the problem in a clear manner. of uncertainty, and thus be less convincing than more
ambiguous data.
Second, we build on this initial analysis and extend
it to the problem of the police identity parade, showing
ourconfidencethataperpetratorhasbeenidentifiedsur-
Formal model
prisingly declines as the number of unanimous witnesses
becomes large. We use this mathematical framework to
Let us now proceed to formalise the problem above.
revisit a specific point of ancient Jewish law—we show
Suppose we have two hypotheses, H and H , and a se-
that it does indeed have a sound basis, even though it 0 1
riesofmeasurementsX=(X ,X ,...,X ). Wedefinea
grossly challenges our na¨ıve expectation. 1 2 n
variable F ∈N that determines the underlying measure-
Third, we finish with a final example to show that our
ment distribution, p (x). We may then use Bayes’
analysis has broader implications and can be applied to X|F,Hi
law to find
electronic systems of interest to engineers. We chose the
example of a cryptographic system and that a surpris- P[X|H ]P[H ]
P[H |X]= i i , (1)
ingly small bit error rate can result in a larger-than- i P[X]
expected reduction in security.
Our analyses ranging from cryptography to criminol-
ogy,provideexamplesofhowrarefailuremodescanhave
which can be expanded by condition with respect to F,
a counterintuitive effect on the achievable level of confi-
yielding
dence.
(cid:88)
P[X|H ,f]P[H ,F =f]
i i
f
= . (2)
A HYPOTHETICAL ROMAN POT (cid:88)
P[X|H ,f]P[H ,F =f]
k k
f,Hk
Let us begin with a simple scenario, the identification
oftheoriginofaclaypotthathasbeendugfromBritish In our examples there are a number of simplifying
soil. ItsdesignidentifiesitasbeingfromtheRomanera, conditions—there are only two hypotheses and two mea-
andallthatremainsistodeterminewhetheritwasmade surement distributions, reducing Eqn. 2 to
3
1 −1
(cid:88)
P[X|H ,F =f]P[H ,F =f]
1−i 1−i
P[H |X]=1+ f=0 . (3)
i 1
(cid:88)
P[X|H , F =f]P[H , F =f]
i i
f=0
Computation of these a posteriori probabilities thus re- Substituting these probability masses into Eqn. 2, we
quires knowledge of two distributions: the measure- see in Figure 1 that as more and more tests return posi-
ment distributions P[X|H ,F], and the state probabili- tive results, we become increasingly certain of its British
k
tiesP[H ,F]. Havingtabulatedthese,wemaysubstitute heritage, but an unreasonably large number of positive
i
themintoEqn.3,yieldingtheaposteriori probabilityfor resultswillindicatecontaminationandsoyieldareduced
each hypothesis. In this paper, the measurement distri- level of certainty.
butions P[X|H ,F =f] are all binomial, however this is
i
not the case in general.
1.0
Analysis of the pot origin distribution pcg=g0
0.9
n]
In the case of the pot, the hypotheses and mea- gi
ori 0.8
surement distributions—the origin and contamination, hg
s
respectively—are shown in Table I. Briti 0.7 10¯³
P[ 10¯²
P[F,3Hi] P[Positive3result3|3F,3Hi]
Origin Origin 0.6 pcg=g10¯¹
Italy Britain Italy Britain
0.5
H₀ H₁ H₀ H₁
0 5 10 15 20 25 30
Unanimously-positivegtests
d Y 0.005 0.005 d Y 0.9 0.9
e e
minat F=0 ½3pc ½3pc minat F=0 FIG. 1. Probability that the pot is of British origin given
a a n numbers of tests, all coming back positive, for a variety
nt nt
Co N 0.495 0.495 Co N 0.3 0.7 of contamination rates pc and a 30% error rate. In the case
of the pot above, with p = 10−2, we see a peak at n = 5,
F=1 ½3)1−pc) ½3)1−pc) F=1 pe 1−pe afterwhichthelevelofcerctaintyfallsbackto0.5asitbecomes
Each square represents an outcome Each square represents a distribution
morelikelythatthepotoriginatesatacontaminatingfactory.
When p = 0, this is the standard Bayesian analysis where
TABLE I. The model parameters for the case of the pot for c
use in Eqn. 3 with a contamination rate p = 10−2. The a failure states are not considered. We see therefore that even
c
smallcontaminationratescanhavealargeeffectontheglobal
priori distribution of the origin is identically 50% for both
behaviour of the testing methodology.
Britain and Italy, whether or not the pot’s manufacturing
process has contaminated the results. As a result, the two
columns of P[F,H ] are identical. The columns of the mea-
i
surement distribution, shown right, differ from one another, Itisworthtakingamoment,however,tobrieflydiscuss
thereby giving the test discriminatory power. When the pot theeffectsofweakeningcertainconditions; inparticular,
has been contaminated, the probability of a positive result is
weconsidertwocases: thatwheretherateofcontamina-
identical for both samples, rendering the test ineffective.
tion depends upon the origin of the pot, and that where
theresultsaftercontaminationarealsoorigin-dependant.
Each measurement is Bernoulli-distributed, and the Where the rate of contamination depends upon the
number of positive results is therefore described by a Bi- origin, evidence of contamination provides some small
nomial distribution, with the probability mass function evidence of where the pot came from. Thus if 80% of
contaminated pots are of British origin, then Figure 1
(cid:18) (cid:19)
n
P[X =x]= px(1−p)n−x will eventually converge to 0.8 rather than 0.5.
x
If the probability of a positive test is dependent upon
after N trials, the probability p being taken from the the origin even when contaminated, then the behaviour
measurement distribution section of Table I. ofthetestprotocolchangesqualitatively. Supposingthat
4
inthecontaminatedcasethetestissubstantiallylessac- P[F,[Hi] P[Identification[|[F,[Hi]
curate,theprobabilityofBritishoriginwilldroptowards Suspect[is… Suspect[is…
0.5, as now, due to the increased likelihood of contami- Innocent Guilty Innocent Guilty
nation; eventually, however, itwillriseagaintowards1.0 H₀ H₁ H₀ H₁
as sufficient data becomes available to make use of the
lessprobativeexperimentthatwenowknowtobetaking e Y 0.005 0.005 e Y 0.9 0.9
d d
a a
place. par F=0 ½[pc ½[pc par F=0
d[ d[
e e
as as
Bi N 0.495 0.495 Bi N 0.13 0.52
THE RELIABILITY OF IDENTITY PARADES F=1 ½[(1−pc) ½[(1−pc) F=1 pfp 1−pfn
Each square represents an outcome Each square represents a distribution
We initially described the scenario of a court case, in
TABLE II. The model parameters for the hypothetical iden-
which witness after witness testifies to having seen the
tity parade. In a similar fashion to the first example, we
defendant commit the crime of which he is accused. But
assume a priori a 50% probability of guilt. In this case,
in-court identifications are considered unreliable, and in
themeasurementdistributionsaresubstantiallyassymmetric
reality if identity is in dispute then the identification is with respect to innocence and guilt, unlike Table I.
made early in the investigation under controlled condi-
tions [16]. At some point, whether before or after being
charged, the suspect has most likely been shown to each of a simulated crime, whether in person or on video, and
witnessamongstanumberofothers,knownasfillers,who askedtolocatetheperpetratoramongstanumberofpeo-
arenotundersuspicion. Eachwitnessisaskedtoidentify ple. In some cases the perpetrator will be present, and
the true perpetrator, if present, amongst the group. in others not. The former allows estimation of the false-
Thisprocess,knownasanidentityparade orline-up,is negative rate of the process—the rate that the witness
anexperimentintendedtodeterminewhetherthesuspect fails to identify the perpetrator when present—and the
is in fact the same person as the perpetrator. It may be latter the false-positive rate—the rate at which an inno-
performedonlyonce, orrepeatedmanytimeswithmany cent suspect will be mistakenly identified. Let us denote
witnesses. Ashumanmemoryisinherentlyuncertain,the by p the false-negative rate; this is equal to the pro-
fn
processwillincluderandomerror;iftheexperimentisnot portion of subjects who failed to correctly identify the
properly carried out then there may also be systematic perpetrator when he was present, and was found in [17]
error, and this is the problem that concerns us in this to be 48%.
paper. Estimatingthefalsepositiverateiscomplicatedbythe
Having seen how a unanimity of evidence can create factthatonlyonesuspectispresentinthelineup—when
uncertainty in the case of the unidentified pot, we now the suspect is innocent, an eyewitness who incorrectly
apply the same analysis to the case of an identity pa- identifies a filler as being the perpetrator has correctly
rade. If the perpetrator is not present—that is to say, rejected the innocent suspect as being the perpetrator,
if the suspect is innocent—then in an unbiased parade despite their error. For the purposes of our analysis, we
the witness should be unable to choose the suspect with assume that the witness selects at random in this case,
a probability greater than chance. Ideally, they would andthereforedividethe80%perpetrator-absentselection
declinetomakeaselection,howeverthisdoesnotalways rateof[17]bythenumberofparticipantsL=6,yielding
occur in practice [16, 17], and forms part of the random a false-positive rate of p =0.133.
fp
error of the procedure. If the parade is biased—whether Let us now suppose that there is a small probability
intentionallyorunintentionally—forexamplebecause(i) p thattheline-upisconductedincorrectly—forexample,
c
the suspect is somehow conspicuous [18], (ii) the staff volunteershavebeenchosenwhofailtoadequatelymatch
running the parade direct the witness towards him, (iii) the description of the perpetrator—leading to identifica-
by chance he happens to resemble the perpetrator more tion of the suspect 90% of the time, irrespective of his
closely than the fillers, or (iv) because the witness holds guilt. For the sake of analysis we assume that if this
a bias, for example because they have previously seen occurs, it will occur for all witnesses, though in prac-
the suspect [16], then an innocent suspect may be se- tice the police might perform the procedure correctly for
lected with a probability greater than chance. This is some witnesses and not others. The probability of the
the hidden failure state that underlies this example; we suspect being identified for each of the cases is shown in
assume in our analysis that this is completely binary— Table II.
either the parade is completely unbiased or it is highly Ifweassumea50%priorprobabilityofguilt,andinde-
biased against the suspect. pendent witnesses, the problem is now identical to that
In recent decades, a number of experiments [17, 19] of identifying the pot. The probability of guilt, given
have been carried out in order to establish the reliability the unanimous parade results, is shown in Figure 2 as a
of this process. Test subjects are shown the commission function of the number of unanimous witnesses.
5
P[F,PHi] P[IdentificationP|PF,PHi]
1.0 SuspectPis… SuspectPis…
0.95 pcg=g0 Innocent Guilty Innocent Guilty
H₀ H₁ H₀ H₁
0.9
ygofgguilt 0.8 10¯³ ceedings F=Y0 0½.0Pp0c5 0½.0Pp0c5 ceedings F=Y0 0.95 0.95
obabilit 0.7 pcg=g10¯² 10¯⁴ asedPpro N 0.495 0.495 asedPpro N 0.14 0.75
Pr Bi F=1 ½P(1−pc) ½P(1−pc) Bi F=1 pfp 1−pfn
0.6
Each square represents an outcome Each square represents a distribution
0.5 TABLE III. The model parameters for the Sanhedrin trial.
0 5 10 15 20 25 Again,weassumeana priori 50%probabilityofguilt. How-
Unanimously-positivegidentifications ever, the measurement distributions are the results of [21,
model (2)] for juries; in contrast to the case of the identity
FIG. 2. Probability of guilt given varying numbers of unani- parade, the false negative rate is far lower. Despite the trial
mousline-upidentifications,assuminga50%priorprobability beingconductedbyjudges,wechoosetousethejuryresults,
ofguiltandidentificationaccuraciesgivenby[17]. Ofnoteis as the judges tendancy towards conviction is not reflected in
thatforthecasethatwehaveplottedherewherethewitnesses the highly risk-averse rabbinic legal tradition.
are unanimous, with a failure rate p = 0.01 it is impossible
c
to reach 95% certainty in the guilt of the suspect, no matter
how many witnesses have been found. morrow in hope of finding new points in favour of the
defence”.
The value of this rule becomes apparent when we con-
Weseethatafteracertainnumberofunanimouslypos- siderthattheSanhedrinwascomposed,forordinarycap-
itive identifications the probability of guilt diminishes. ital offenses, of 23 members [13, Sanhedrin 2a]. In our
Even with only one in ten-thousand line-ups exhibiting line-up model, this many unanimous witnesses would in-
this bias towards the suspect, the peak probability of dicate a probability of guilt scarcely better than chance,
guiltisreachedwithonlyfiveunanimouswitnesses,com- suggesting that the inclusion of this rule should have a
pletely counter to intuition—in fact, with this rate of substantial effect.
failure, ten identifications in agreement provide less evi- We show the model parameters for the Sanhedrin de-
dence of guilt than three. We see also that even with a cision in Table III, which we use to compute the proba-
50%priorprobabilityofguilt,a1%failureraterendersit bility of guilt in Figure 3 for various numbers of judges
impossible to achieve 95% certainty if the witnesses are condemning the defendant. We see that the probabil-
unanimous. ity of guilt falls as judges approach unanimity, however
This tendency to be biased towards a particular mem- excluding unanimous decisions substantially reduces the
berofthelineupwhenanerroroccurshasbeennoted[16, probability of false conviction.
paragraph4.31]priortothemorerigorousresearchstim- Itisworthstressingthattheexactshapesofthecurves
ulated by the advent of DNA testing, leading us to sus- in Figure 3 are unlikely to be entirely correct; commu-
pect that our sub-1% contamination rates are probably nication between the judges will prevent their verdicts
overly optimistic. from being entirely independent, and false-positive and
false-negative rates will be very much dependent upon
the evidentiary standard required to bring charges, the
ANCIENT JUDICIAL PROCEDURE strength of the contamination when it does occur, and
the accepted burden of proof of the day. However, it
The acknowledgement of this type of phenomenon is is nonetheless of qualitative interest that with reason-
notentirelynew;indeed,theadage”toogoodtobetrue” able parameters, this ancient law can be shown to have
dates to the sixteenth century [20, good, P5.b]. More- a sound statistical basis.
over, its influence on judicial procedure was visible in
Jewishlawevenintheclassicalera;untiltheRomansul-
timately removed the right of the Sanhedrin to confer THE RELIABILITY OF CRYPTOGRAPHIC
death sentences, a defendant unanimously condemned SYSTEMS
by the judges would be acquitted [13, Sanhedrin 18b],
the Talmud stating “If the Sanhedrin unanimously find Wenowconsideradifferentexample,drawnfromcryp-
guilty,heisacquitted. Why? —Becausewehavelearned tography. An important operation in many protocols is
by tradition that sentence must be postponed till the thegenerationandverificationofprimenumbers; these-
6
int trialdivision (long to test)
1.0 {
long i;
10¯⁴
long threshold;
uilt 0.8 10¯³
g
y¯of¯ 0.6 pc¯=¯10¯² {if(to test % 2 == 0)
bilit return 1;
a
b 0.4 }
o
Pr
0.2 threshold = (long)sqrt(to test);
Conviction
for(i = 3; i <= threshold; i += 2)
0.0
0 5 10 15 20 {
if(to test % i == 0)
Condemning¯judges
{
return 1;
FIG. 3. Probability of guilt as a function of judges in agree-
}
ment out of 23—the number used by the Sanhedrin for most
}
capital crimes—for various contamination rates p . We as-
c
sume as before that half of defendants are guilty, and use
return 0;
the estimated false-positive and false-negative rates of juries
}
from [21, model (2)], 0.14 and 0.25 respectively. We arbi-
trarily assume that a ‘contaminated’ trial will result in the a
positive vote 95% of the time. The panel of judges numbers FIG. 4. A function that tests for primality by attempting to
23, with conviction requiring a majority of two and at least factorise its input by brute force.
one dissenting opinion [13, Sanhedrin]; the majority of two
means that the agreement of at least 13 judges is required
in order to to cast a sentence of death, to a maximum of
matelyaλ=10−19 probabilitythatanygivenbitwillbe
22 votes in order to satisfy the requirement of a dissenting
flipped in any given second. We will make the assump-
opinion. These necessary conditions for a conviction by the
Sanhedrin are shown as the pink region in the graph. tion that, in the machine code for the primality-testing
routine, there exists at least one bit that, if flipped, will
cause all composite numbers—or some class of compos-
curity of some protocols depends upon the primality of ite numbers known to the adversary—to be accepted as
a number that may be chosen by an adversary; in this prime. Asanexampleofhowthiscouldhappen,consider
case, one may test whether it is a prime, whether by the function shown in Figure 4 that implements a brute-
brute-force or by using another test such as the Rabin- force factoring test. Assuming that the input is odd, the
Miller [22, p. 176] test. As the latter is probabilistic, function will reach one of two return statements, return-
we repeat it until we have achieved the desired level ingzeroorone. TheCcompilerGCCcompilesthesetwo
of security—in [22], a probability 2−128 of accepting a return statements to
composite as prime is considered acceptable. However, a
na¨ıve implementation cannot achieve this level of secu- 45 0053 B8010000 movl $1, %eax
rity, as we will demonstrate. 45 00
46 0058 EB14 jmp .L3
The reason is that despite it being proven that each
iteration of the Rabin-Miller test will reject a composite
and
number with probability at least 0.75, a real computer
may fail at any time. The chance of this occurring is
56 0069 B8000000 movl $0, %eax
small,howeveritturnsoutthattheprobabilityofastray
56 00
cosmicrayflippingabitinthemachinecode,causingthe
testtoacceptcompositenumbers,issubstantiallygreater
respectively. That is to say, it stores the return value as
than 2−128.
an immediate into the EAX register and then jumps to
the cleanup section of the function, labelled .L3. The
store instructions on lines 45 and 56 have machine-code
Code changes caused by memory errors valuesB801000000andB8000000000forreturnvaluesof
one and zero respectively. These differ by only one bit,
Data provided by Google [23] suggests that a given and therefore can be transformed into one another by a
memory module has approximately an 8% probability of single bit-error. If the first instruction is turned into the
suffering an error in any given year, independent of ca- second,thiswillcausethefunctiontoreturnzeroforany
pacity. Assuminga4GBmodule,thisresultsinapproxi- oddinput,thusalwaysindicatingthattheinputisprime.
7
P[F,=Hi] P[Acceptance=|=F,=Hi]
1
Number=is… Number=is…
Prime Composite Prime Composite y
H₀ H₁ H₀ H₁ bilit2¯³²
ba Afterm1mo
ositive=result F=Y0 0.000.010×11=p0c¯¹³0.909.999×9=1p0c¯¹³ ositive=result F=Y0 1.0 1.0 ptancempro2¯⁶⁴ AAfftteerrmm11ds
p p e
ways= N ≈0.001 ≈0.999 ways= N 1.0 0.25 emacc2¯⁹⁶ Afterm1ns
Al F=Ea1ch sq0u.0a0r1e= |re1p−rpesce(nts 0a.n9 9o9u=t|c1o−mpec( Al F=Ea1ch square represents a distribution Fals Nomfaults
2¯¹²⁸
TABLE IV. Model parameters for the Rabin-Miller test on 0 16 32 48 64
random2000-bitnumbers. However,wehavenochoicebutto Rabin-Millermiterations
assume the lower bound on the composite-number rejection
rate, and so this model is inappropriate. Furthermore, in
an adversarial setting the attacker may intentionally choose FIG.5. Theacceptancerateasafunctionoftimeinmemory
a difficult-to-detect composite number, rendering the prior and the number of Rabin-Miller iterations under the single-
distribution optimistic. errorfaultmodeldescribedinthispaper. Anacceptancerate
of2−128 isnormallychosen,howeverwithouterrorcorrection
this cannot be achieved. The false-acceptance rate after k
iterations is given by p [k] = 4−k(1−p )+p , where p is
fa f f f
the probability that a fault has occurred that causes a false
The effect of memory errors on confidence acceptance 100% of the time. We estimate p to be equal
f
to 10−19T, where T is the length of time in seconds that the
code has been in memory.
Atcryptographically-interestingsizes—ontheorderof
22000—roughly one in a thousand numbers is prime [22,
p. 173]. We might calculate the model parameters as 4−k, but
before—forinterest’ssake,wehavedonesoinTableIV—
andcalculatetheconfidenceinanumber’sprimalityafter pfa =P[AF ∪AR] (4)
agivennumberoftests. However,thisisnotparticularly =P[A ]+P[A ]−P[A ,A ]. (5)
F R F R
useful, for two reasons: first, the rejection probability of
75% is a lower bound, and for randomly chosen numbers
Since A and A are independent,
F R
is a substantial underestimate; second, we do not always
choose numbers at random, but rather may need to test
=P[A ]+P[A ]−P[A ]P[A ] (6)
those provided by an adversary. In this case, we must F R F R
assume that they have tried to deceive us by providing =4−k(1−pf)+pf (7)
a composite number, and would instead like to know the ≥p . (8)
f
probability that they will be successful. The Bayesian
estimator in this case would provide only a tautology of No matter how many iterations k of the algorithm are
the type: ‘given the data and the fact that the number performed, this is substantially greater than the 2−128
is composite, the number is composite’. securitylevelthatispredictedbyprobabilisticanalysisof
thealgorithmalone,thusdemonstratingthatalgorithmic
Let us suppose that the machine containing the code
analyses that do not take into account the reliability of
is rebooted every month, and the Rabin-Miller code re-
the underlying hardware can be highly optimistic. The
mainsinmemoryforthedurationofthisperiod;then,ne-
false acceptance rate as a function of the number of test
glecting other potential errors that could affect the test,
iterations and time in memory is shown in Figure 5.
at the time of the reboot the probability that the bit
A real cryptographic system will include many such
has flipped is now p = 2.6×10−13; this event we de-
f
checks in order to make sure that an attacker has not
note A . Let k be the number of iterations performed;
F
chosen weak values for various parameters, and a failure
the probability of accepting a composite number is at
ofanyofthesemayresultinthesystembeingbroken, so
most 4−k, and we assume that the adversary has chosen
our calculations are somewhat optimistic.
a composite number such that this is the true probabil-
Error-correcting-code equipped (ECC) memory will
ity of acceptance. We denote the event that the prime is
substantially reduce the risk of this type of fault, and
accepted by the correctly-operating algorithm A .
R
forregularly-accessedregionsofcode—multipletimesper
When hardware errors are taken into account, the second—will approach the 2−128 level. A single parity
probabilityofacceptingacompositenumberisnolonger bit, as used in at least some CPU level-one instruction
8
caches[24],requirestwobit-flipstoinduceanerror. Sup- judicialpanelindicatesthattherequirementofadissent-
posetheparityischeckedeveryRseconds,thentheprob- ingopinionwouldhaveprovidedasubstantialincreasein
ability of an undetected bit-flip in any given second is the probability of guilt required to secure a conviction.
Applied to cryptographic systems, we see that even
(λR)2 the minuscule probability that one particular bit in the
λ(cid:48) = =λ2R. (9)
R system’s machine code will be flipped due to a memory
error over the course of a month, rendering the system
Forcodethatisaccessedevenmoderatelyoften,thiswill insecure, is approximately 280 times larger than the risk
come much closer to 2−128. For example, if R = 100ms
predictedbyalgorithmicanalysis. Thisdemonstratesthe
then this results in a false-acceptance rate of 2−108 af-
importance of strong error correction in modern crypto-
ter one month, much closer to the 2−128 level of secu-
graphicsystemsthatstriveforafailurerateontheorder
rity promised by analysis of the algorithm alone. The of2−128,alevelofcertaintythatappearstobeotherwise
stronger error-correction codes used by the higher-level
unachievable without active mitigation of the effect.
caches and main memory will detect virtually all such
The use of naturally-occuring memory errors for DNS
errors—with two-bit detection capability, the rate of un-
hijacking[26]haspreviouslybeendemonstrated,andthe
detected bit-flips will be at most
abilityofausertodisturbprotectedaddressesbywriting
to adjacent cells [27] has been demonstrated, however
λ(cid:48) =λ3R2, (10)
little consideration has been given to the possibility that
thistypeoffaultmightoccursimplybychance,implying
and even with check rate of only once per 100ms, the
thatsecurityanalyseswhichassumereliablehardwareare
rate of memory errors is essentially zero, increasing the
substantially flawed when applied to consumer systems
false-acceptance rate by a factor of only 10−14 above the
lacking error-corrected memory.
2−128 levelthatwouldbeachievedinaperfectoperating
We have considered only a relatively simple case, in
environment.
which there are only two levels of contamination. How-
ever,inpracticalsituationswemightexpectanyofawide
rangeoffailuremodesvaryingcontinuously. Wehavede-
DISCUSSION scribed a simple case in the appendix, where a coin may
bebiased—ornot—towardseitherheadsortailswithany
Thisphenomenonisinterestinginthatitiscommonly strength; were one to apply this to the case of an iden-
known and applied heuristically, and trivial examples tity parade, for example, one would find a probability
such as the estimation of coin bias [25, section 2.1] thatthesuspectisindeedtheperpetrator,asbefore,but
have been well-analysed—see the appendix for a brief taking into account that there may well be slight biases
discussion—but these rare failure states are rarely, if that nudge the witnesses towards or away from the sus-
ever, considered when a statistical approach to decision- pect, not merely catastrophic ones. The result is heavily
makingisappliedtoanentiresystem. Realsystemsthat dependent upon the distribution of the bias, and lacking
attempt to counter failure modes producing consistent sufficient data to produce such a model we have chosen
data tend to focus upon the detection of particular fail- toeschewthecomplexityofthecontinuousapproachand
uresratherthanthemerefactofconsistency. Sometimes focus on a simple two-level model. An example of this
there is little choice—a casino that consistently ejected approach is shown in [28, p. 117].
gamblersonawinningstreakwouldsoonfinditselfwith- A related concept to that which we have discussed
out a clientelle—however we have demonstrated that in is the Duhem-Quine hypothesis [28, p. 6]; this is the
many cases the level of consistency needed to cast doubt idea that an experiment inherently tests hypotheses as
on the validity of the data is surprisingly low. a group—not merely the phenomenon that we wish to
Ifthisisso,thenwemustreconsidertheuseofthresh- examine, but also the correct function of the experimen-
olding as a decision mechanism when there is the poten- talapparatusforexample,andthatthatonlythedesired
tialforsuchfailuremodestoexist,particularlywhenthe independent variables are being changed. Our thesis is a
consequencesofanincorrectdecisionarelarge. Whenthe related one, namely that in practical systems the failure
decision rule takes the form of a probability threshold, it of these auxiliary hypotheses, though unlikely, result in
is necessary to deduce an upper threshold as well, such a significant reduction in confidence when it occurs, an
aswasshowninFigure3,inordertoavoidcapturingthe effect which has traditionally been ignored.
region indicative of a systemic failure.
That this phenomenon was accounted for in ancient
Jewish legal practice indicates a surprising level of in- CONCLUSION
tuitive statistical sophistication in this ancient law code;
thoughpredatingbymillenniathestatisticaltoolsneeded We have analysed the behaviour of systems that are
to perform a rigorous analysis, our simple model of the subjecttosystematicfailure,anddemonstratedthatwith
9
relatively low failure rates, large sample sizes are not re- They use Bayes’ law in its proportional form,
quired in order that unanimous results start to become
P[Q|{data}]∝P[{data}|Q]P[Q], (11)
indicative of systematic failure. We have investigated
theeffectofthisphenomenonuponidentityparades,and
where Q is the probability that a coin-toss will yield
shownthatevenwithonlya1%rateoffailure,confidence
heads. VariouspriordistributionsQ[H]canbechosen, a
begins to decrease after only three unanimous identifica-
matter that we will discuss momentarily.
tions, failing to reach even 95%.
As the coin tosses are independent, the data can
We have also applied our analysis of the phenomenon
be boiled down to a binomial random variable X ∼
to cryptographic systems, investigating the effect by
Bin(p,n), where n is the number of coin tosses made.
which confidence in the security of a parameter fails to
Substituting the binomial probability mass function into
increase with further testing due to potential failures of
Eqn. 11, they find that
the underlying hardware. Even with a minuscule failure
rate of 10−13 per month, this effect dominates the anal- P[Q|X]∝QX(1−Q)n−XP[Q]. (12)
ysis and is thus a significant determining factor in the
overall level of security, increasing the probability that a As the number of samples n increases, this becomes in-
maliciously-chosenparameterwillbeacceptedbyafactor creasinglypeakedaroundthevalueQ=X/n,this‘peak-
of more than 280. ing’ effect limited by the shape of P[Q]. As the number
Hidden failure states such as these reduce confidence of samples increases, the QX(1−Q)n−X part of the ex-
farmorethanintuitionleadsonetobelieve,andmustbe pression eventually comes to dominate the shape of the
more carefully considered than is the case today if the posteriordistributionP[Q|X],andwehavenochoicebut
lofty targets that we set for ourselves are to be achieved to believe that the coin genuinely does have a bias close
in practice. to X/n.
Intheexamplespreviouslydiscussed,wehaveassumed
that bias is very unlikely; in the coin example, this cor-
COMPETING INTERESTS responds to a prior distribution P[Q] that is strongly
clusteredaroundQ=0.5; inthiscase, averylargenum-
ber of samples will be necessary in order to conclusively
We have no competing interests.
reject the hypothesis that the coin is unbiased or nearly
so. However, eventually this will occur, and the poste-
AUTHOR CONTRIBUTIONS rior distribution will change; when this occurs, the sys-
temhasvisiblyfailed—acasinousingthecoinwilldecide
that they are not in fact playing the game that they had
LJGdraftedthemanuscript. LJGandDAdevisedthe
planned, andmustceasebeforetheirlossbecomescatas-
concept. LJG, FC-B, MDM, BRD, AA, and DA carried
trophic. ThisismuchlikeinthecaseoftheSanhedrin—if
out analyses and checking.
toomanyjudgesagree,thesystemhasfailed,andshould
All authors contributed to proofing the manuscript.
not be considered reliable.
All authors gave final approval for publication.
FUNDING
∗ [email protected]
LachlanJ.GunnisaVisitingScholarattheUniversity † [email protected]
of Angers, France, supported by an Endeavour Research ‡ [email protected]
Fellowship from the Australian Government. § [email protected]
MarkD.McDonnellissupportedbyanAustralianRe- ¶ [email protected]
∗∗ [email protected]
search Fellowship (DP1093425) and Derek Abbott
†† SchoolofElectricalandElectronicEngineering,TheUni-
is supported by a Future Fellowship (FT120100351),
versity of Adelaide 5005, Adelaide, Australia
both from the Australian Research Council (ARC).
[1] R.Benzi,A.Sutera,andA.Vulpiani,“Themechanismof
stochastic resonance,” Journal of Physics A: Mathemat-
ical and General, vol. 14, no. 11, pp. L453–L457, 1981.
Analysis of a biased coin [2] M. D. McDonnell, N. G. Stocks, C. E. M. Pearce, and
D. Abbott, Stochastic Resonance: From Suprathreshold
Stochastic Resonance to Stochastic Signal Quantization.
It is worth adding a brief discussion of a simple and
Cambridge University Press, 2008.
well-known problem that has some relation to what we
[3] M.D.McDonnellandD.Abbott,“Whatisstochasticres-
have discussed, namely the question of whether or not a
onance? Definitions,misconceptions,debates,anditsrel-
coin is biased. We follow the Bayesian approach given evance to biology,” PLoS Computational Biology, 2009,
in [25]. art. e1000348.
10
[4] G. P. Harmer and D. Abbott, “Game theory: losing 107–121, 1994.
strategies can win by Parrondo’s paradox,” Nature, vol. [18] M. S. Wogalter and D. B. Marwitz, “Suggestiveness in
402, p. 864, 1999. photospread lineups: similarity induces distinctiveness,”
[5] D. Abbott, “Asymmetry and disorder: a decade of Par- Applied Cognitive Psychology, vol. 6, no. 5, pp. 443–452,
rondo’s paradox,” Fluctuation and Noise Letters, vol. 9, 1992.
no. 1, pp. 129–156, 2010. [19] R. S. Malpass and P. G. Devine, “Eyewitness identifica-
[6] M.N.Mead,“Columbiaprogramdigsdeeperintoarsenic tion: lineupinstructionsandtheabsenceoftheoffender,”
dilemma,”Environ. Health Perspect.,vol.113,no.6,pp. Journal of Applied Psychology, vol. 66, no. 4, pp. 482–
A374–A377, 2005. 489, 1981.
[7] D. Braess, “U¨ber ein Paradoxon aus der Verkehrspla- [20] Oxford English Dictionary. Oxford University Press,
nung,” Unternehmensforschung, vol. 12, pp. 258–268, http://oed.com/, accessed 2015-10-06.
1969. [21] B. D. Spencer, “Estimating the accuracy of jury ver-
[8] H. Kameda, E. Altman, T. Kozawa, and Y. Hosokawa, dicts,” Journal of Empirical Legal Studies, vol. 4, no. 2,
“Braess-likeparadoxesindistributedcomputersystems,” pp. 305–329, 2007.
IEEE Transactions on Automatic Control,vol.45,no.9, [22] N. Ferguson, B. Schneier, and T. Kohno, Cryptography
pp. 1687–1691, 2000. Engineering: Design Principles and Practical Applica-
[9] Y.A.Korilis,A.A.Lazar,,andA.Orda,“Avoidingthe tions. Wiley, Indianapolis, USA, 2010.
Braess paradox in noncooperative networks,” Journal of [23] B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM
Applied Probability, vol. 36, no. 1, pp. 211–222, 1999. errors in the wild: A large-scale field study,” in Proceed-
[10] A.P.FlitneyandD.Abbott,“Quantumtwo-andthree- ings of the Eleventh International Joint Conference on
person duels,” J. Opt. B: Quantum Semiclass. Opt., Measurement and Modeling of Computer Systems, SIG-
vol. 6, no. 8, pp. S860–S866, 2004. METRICS ’09, Seattle, WA, USA, 2009, pp. 193–204.
[11] G. P. Harmer, D. Abbott, P. G. Taylor, and J. M. R. [24] Advanced Micro Devices, “AMD
Parrondo, “Brownian ratchets and Parrondo’s games,” Opteron processor product datasheet,”
Chaos, vol. 11, no. 3, pp. 705–714, 2001. http://support.amd.com/TechDocs/23932.pdf, 2007,
[12] S. N. Ethier and J. Lee, “Parrondo’s paradox via redis- accessed 2015-10-12.
tribution of wealth,” Electron. J. Probab., vol. 17, 2012, [25] D. S. Sivia and J. Skilling, Data Analysis: A Bayesian
art. 20. Tutorial. Oxford University Press, 2006.
[13] I.Epstein,Ed.,TheBabylonianTalmud. SoncinoPress, [26] A. Dinaberg, “Bitsquatting: DNS hijacking without ex-
London, 1961. ploitation,”inProceedings of BlackHat Security,LasVe-
[14] R. A. Fisher, “Has Mendel’s work been rediscovered?” gas, USA, 2011.
Annals of Science, vol. 1, no. 2, pp. 115–137, 1936. [27] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee,
[15] A. Franklin, Ending the Mendel-Fisher Controversy. C. Wilkerson, K. Lai, and O. Mutlu, “Flipping bits in
University of Pittsburgh Press, 2008. memorywithoutaccessingthem: Anexperimentalstudy
[16] P. Devlin, C. Freeman, J. Hutchinson, and P. Knights, of DRAM disturbance errors,” in Proc. IEEE 41st An-
ReporttotheSecretaryofStatefortheHomeDepartment nual International Symposium on Computer Architecu-
of the Departmental Committee on Evidence of Identifi- ture, Minneapolis, MN, USA, 2014, pp. 361–372.
cation in Criminal Cases. HMSO, London, 1976. [28] L. Bovens and S. Hartmann, Bayesian Epistemology.
[17] R.A.Foster,T.M.Libkuman,J.W.Schooler,andE.F. Oxford, 2004.
Loftus, “Consequentiality and eyewitness person identi-
fication,”Applied Cognitive Psychology,vol.8,no.2,pp.