Table Of ContentPHYSTAT2003, SLAC, Stanford, California, September 8-11,2003 1
A Measure of the Goodness of Fit in Unbinned Likelihood Fits; End of
Bayesianism?
RajendranRaja
Fermilab,Batavia,IL60510,USA
4 Maximumlikelihoodfitstodatacanbedoneusingbinneddata(histograms)andunbinneddata. Withbinned
data, one gets not only the fitted parameters but also a measure of the goodness of fit. With unbinned data,
0
currently, the fitted parameters are obtained but no measure of goodness of fit is available. This remains, to
0
date, an unsolved problem in statistics. Using Bayes’ theorem and likelihood ratios, we provide a method by
2
whichboth thefitted quantities andameasureofthegoodness offitareobtained forunbinned likelihoodfits,
n aswellaserrorsinthefittedquantities. Thequantity,conventionallyinterpretedasaBayesianprior,isseenin
a thisschemetobeanumbernotadistribution,thatisdeterminedfromdata.
J
9
2 1. Introduction 2. Likelihood ratios
]
n As of the Durham conference [1], the problem of In binned likelihood cases, where one is comparing
a obtaining a goodness of fit in unbinned likelihood fits a theoretical distribution P(cs) with a binned his-
a- was an unsolved one. In what follows, we will de- togram,therearetwodistribut|ionsinvolved,thetheo-
t notebythevectors,the theoreticalparameters(s for reticaldistributionandthedatadistribution. Thepdf
a
d “signal”) and the vector c, the experimentally mea- ofthedataisapproximatedbythebincontentsofthe
. sured quantities or “configurations”. For simplicity, histogramnormalizedto unity. If the data consists of
cs we will illustrate the method where both s and c are nevents,thepdf ofthedataPdata(c)isdefinedinthe
i one dimensional, though either or both can be multi- frequentist sense as the normalized density distribu-
s
y dimensional in practice. We thus define the theo- tion in c space of n events as n . In the binned
→ ∞
h retical model by the conditional probability density case,we canbin in finer andfiner bins as n and
→∞
p P(cs). Then an unbinned maximum likelihood fit to obtain a smooth function, which we define as the pdf
[ data| is obtained by maximizing the likelihood [2], of the data Pdata(c). In practice, one is always lim-
ited by statistics and the binned function will be an
2
v i=n approximation to the true pdf. We can now define a
3 L= P(ci|s) (1) likelihood ratio LR such that
3 Y
i=1
11 where the likelihood is evaluated at the n observed LR = Qi=ii==nn1PPda(ctai|(sc)) ≡ PPda(ctan(|csn)) (2)
0 data points c ,i = 1,n. Such a fit will determine i=1 i
04 the maximumi likelihood value s∗ of the theoretical where we haveQused the notation cn to denote the
/ parameters, but will not tell us how good the fit is. event set ci,i = 1,n. Let us now note that R is
s The value of the likelihood at the maximum like- invariant under the variable transformation c L c′,
ic lihood point does not furnishLa goodness of fit, since since →
s the likelihood is not invariant under change of vari-
y
h able. This can be seen by observing that one can P(c′ s)= dc P(cs) (3)
p transform the variable set c to a variable set c′ such | |dc′| |
v: that P(c′|s∗) is uniformly distributed between 0 and Pdata(c′)= dc Pdata(c) (4)
1. Such a transformation is known as a hypercube |dc′|
i
X transformation, in multi-dimensions. Other datasets ′ = (5)
r will yield different values of likelihood in the variable LR LR
a spacecwhenthelikelihoodiscomputedwiththeorig- and the Jacobian of the transformation dc cancels
inal function P(cs∗). However,in the originalhyper- in the numerator and denominator in the|drca′t|io. This
|
cubespace,thevalueofthelikelihoodisunityregard- is an extremely important property of the likelihood
less of the dataset c′,i = 1,n, thus the likelihood ratio that qualifies it to be a goodness of fit vari-
cannotfurnishagoodinessoffitbyitself,sinceneitheLr able. LSiRncethedenominatorPdata(cn)isindependent
the likelihood, nor ratios of likelihoods computed us- ofthetheoreticalparameterss,boththelikelihoodra-
ing the same distribution P(cs∗) is invariant under tio and the likelihoodmaximize at the same point s∗.
|
variabletransformations. Thefundamentalreasonfor One can also show [3] that the maximum value of the
this non-invariance is that only a single distribution, likelihoodratiooccurswhenthetheoreticallikelihood
namely,P(cs∗)isbeingusedtocomputethegoodness P(c s)andthedatalikelihoodPdata(c )areequalfor
i i
| |
of fit. all c .
i
MOCT003
2 PHYSTAT2003, SLAC, Stanford, California, September 8-11,2003
3. Binned Goodness of Fit 5. An illustrative example
In the case where the pdf Pdata(c) is estimated by We consider a simple one-dimensional case where
binned histograms and the statistics are Gaussian, it the data is an exponential distribution, say decay
isreadilyshown[3]thatthe commonlyusedgoodness times of a radioactive isotope. The theoretical pre-
offitvariableχ2 = 2log R. Itisworthemphasizing diction is given by
− L
that the likelihood ratio as defined above is needed
and not just the negative log of theoretical likelihood 1 c
P(cs)= exp( ) (11)
P(cn s) to derive this result. The popular conception | s −s
that χ| 2 is -2 log P(cn s) is simply incorrect!. It can
| We have chosen an exponential with s = 1.0 for this
also be shown that the likelihood ratio defined above
example. The Gaussian Kernel for the PDE would
can describe the binned cases where the statistics are
be given by
Poissonian[4]. Inorderto solve ourproblemofgood-
ness of fit in unbinned likelihood cases, one needs to 1 c2
arriveatamethodofestimatingthedatapdf Pdata(c) (c)= exp( ) (12)
G (√2πσh) −2σ2h2
without the use of bins.
wherethevarianceσ oftheexponentialisnumerically
equal to s. To begin with, we chose a constant value
4. Unbinned Goodness of Fit
for the smoothing parameter, which for 1000 events
generated is calculated to be 0.125. Figure 1 shows
Oneofthe better knownmethods ofestimating the
thegeneratedevents,thetheoreticalcurveP(cs)and
probability density of a distribution in an unbinned |
the PDE curve P(c) normalized to the number of
case is by the use of Probability Density Estimators
events. The PDE fails to reproduce the data near
(PDE′s),alsoknownasKernelDensityEstimators[5]
the origin due to the boundary effect, whereby the
(KDE′s). The pdf Pdata(c) is approximated by
Gaussian probabilities for events close to the origin
i=n spillovertonegativevaluesofc. Thislostprobability
1
Pdata(c) PDE(c)= (c c ) (6) would be compensated by events on the exponential
i
≈ n G −
Xi=1 distribution with negative c if they existed. In our
case, this presents a drawback for the PDE method,
where a Kernel function (c c ) is centered around
i
G − which we will remedy later in the paper using PDE
each data point c , is so defined that it normalizes to
i
definitions on the hypercube and periodic boundary
unity and for large n approaches a Dirac delta func-
conditions. For the time being, we will confine our
tion [3]. The choice of the Kernel function can vary
example to values of c > 1.0 to avoid the boundary
depending on the problem. A popular kernel is the
effect.
Gaussian defined in the multi-dimensional case as
In order to test the goodness of fit capabilities of
1 Hαβcαcβ
(c)= exp(− ) (7) the likelihood ratio R, we superimpose a Gaussian
G (√2πh)d (det(E)) 2h2 on the exponential Land try and fit the data by a
where E is the erropr matrix of the data defined as simple exponential. Figure 2 shows the “data” with
1000eventsgeneratedasanexponentialinthefiducial
Eα,β =<cαcβ > <cα ><cβ > (8)
range 1.0 < c < 5.0. Superimposed on it is a Gaus-
−
and the <> implies average over the n events, and sian of 500 events. More events in the exponential
d is the number of dimensions. The Hessian matrix are generated in the interval 0.0 < c < 1.0 to avoid
H is defined as the inverse of E and the repeated theboundaryeffectatthefiducialboundaryatc=1.0.
indices imply summing over. The parameter h is a Since the number density variessignificantly, we have
“smoothing parameter”,which has[6] a suggested op- had to introduce a method of iteratively determining
timal value h n−1/(d+4), that satisfies the asymp- the smoothing factor as a function of c as described
∝
totic condition in [3]. With this modification in the PDE, one gets
a good description of the behavior of the data by the
(c c ) lim (c c )=δ(c c ) (9)
G∞ − i ≡n→∞G − i − i PDE as shown in Figure 2. We now vary the num-
ber of events in the Gaussian and obtain the value of
Theparameterhwilldependonthelocalnumberden-
the negative log likelihood ratio as a function
sity and will have to be adjusted as a function of the
NLLR
of the strength of the Gaussian. Table I summarizes
localdensitytoobtaingoodrepresentationofthedata
the results. The number of standard deviations the
by the PDE. Our proposal for the goodness of fit in
unbinnedlikelihoodfit is fromwhatis expectedis de-
unbinned likelihood fits is thus the likelihood ratio
termined empirically by plotting the value of
P(cn s) P(cn s) NLLR
= | | (10) fora largenumber offits whereno Gaussianis super-
LR Pdata(cn) ≈ PPDE(cn) imposed(i.e. thenullhypothesis)anddeterminingthe
evaluated at the maximum likelihood point s∗. mean and RMS of this distribution and using these
MOCT003
PHYSTAT2003, SLAC, Stanford, California, September 8-11,2003 3
Table I Goodness of fit results from unbinnedlikelihood
and binned likelihood fits for various data samples. The
negative valuesfor thenumberof standard deviations in
some of the examples is dueto statistical fluctuation.
Numberof Unbinnedfit Unbinnedfit Binned fit χ2
Gaussian events NLLR Nσ 39 d.o.f.
500 189. 103 304
250 58.6 31 125
100 11.6 4.9 48
85 8.2 3.0 42
75 6.3 1.9 38
50 2.55 -0.14 30
0 0.44 -1.33 24
Figure 1: Figure shows the histogram (with errors) of It can be seen that -log P(cn|s) and -log PPDE(cn)
are correlated with each other and the difference be-
generated events. Superimposed is thetheoretical curve
P(c|s) and thePDE estimator (solid) histogram with no tween the two (-log ) is a much narrower dis-
NLLR
errors. tribution than either and provides the goodness of fit
discrimination.
Figure 2: Figure shows the histogram (with errors) of
Figure 3: (a) shows the distribution of thenegative
1000 eventsin thefiducial interval1.0<c<5.0
generated as an exponential with decay constant s=1.0 log-likelihood -loge(P(cn|s)) for an ensemble of
experimentswhere data and experiment are expected to
with a superimposed Gaussian of 500 eventscentered at
fit. (b) Showsthe negative log PDE likelihood
c=2.0 and width=0.2. The PDE estimator is the(solid)
histogram with no errors. -loge(P(cn)) for thesame data (c) Shows thecorrelation
between thetwo and (d) Shows thenegative
log-likelihood ratio NLLRthat is obtained by
subtracting(b) from (a) on an event byevent basis.
to estimate the number of σ’s the observed is
NLLR
from the null case. Table I also gives the results of
a binned fit on the same “data”. It can be seen that
the unbinned fit gives a 3σ discrimination when the 5.1. Improving the PDE
number of Gaussianevents is 85,where as the binned
fit gives a χ2/ndf of 42/39 for the same case. We in- ThePDEtechniquewehaveusedsofarsuffersfrom
tend to make these tests more sophisticated in future two drawbacks; firstly, the smoothing parameter has
work. to be iteratively adjusted significantly over the full
Figure3showsthevariationof-logP(cn s)and-log range of the variable c, since the distribution P(cs)
PPDE(cn) for an ensemble of 500 experim| ents each changes significantly over that range; and second|ly,
with the number of events n = 1000 in the exponen- there are boundary effects at c=0 as shown in fig-
tial and no events in the Gaussian (null hypothesis). ure 1. Both these flaws are remedied if we define the
MOCT003
4 PHYSTAT2003, SLAC, Stanford, California, September 8-11,2003
PDE in hypercube space. After we find the maxi-
mum likelihood point s∗, for which the PDE is not
needed, we transform the variable c c′, such that
the distribution P(c′ s∗) is flat and 0→< c′ < 1. The
|
hypercube transformation can be made even if c is
multi-dimensional by initially going to a set of vari-
ables that are uncorrelated and then making the hy-
percube transformation. The transformation can be
suchthatanyintervalincspacemapsontotheinter-
val (0,1) in hypercube space. We solve the boundary
problembyimposingperiodicityinthehypercube. In
the one dimensional case, we imagine three “hyper-
cubes”, each identical to the other on the real axis
in the intervals ( 1,0), (0,1) and (1,2). The hyper-
−
cubeofinterestistheoneinthe interval(0,1). When
the probabilityfromaneventkernelleaks outside the
boundary(0,1),wecontinuethekerneltothenexthy-
percube. Since the hypercubes are identical, this im- Figure 4: The distribution of thenegative log likelihood
ratio NLLR for the nullhypothesis for an ensemble of
plies the kernelre-appearingin the middle hypercube
500 experiments each with 1000 events,as a function of
butfromtheoppositeboundary. Putmathematically,
thesmoothing factor h=0.1, 0.2 and 0.3
the kernel is defined such that
(c′ c′)= (c′ c′ 1); c′ >1 (13)
G − i G − i− that it is also easier to arriveat an analytic theory of
(c′ c′)= (c′ c′ +1); c′ <0 (14)
G − i G − i with the choice of this simple kernel.
NLLR
AlthoughaGaussianKernelwillworkonthehyper-
cube, the natural kernel to use considering the shape
of the hypercube would be the function (c′)
G
1 h
(c′)= ; c′ < (15)
G h | | 2
h
(c′)=0; c′ > (16)
G | | 2
Thiskernelwouldbesubjecttotheperiodicboundary
conditions given above, which further ensure that ev-
ery event in hypercube space is treated exactly as ev-
eryothereventirrespectiveoftheirco-ordinates. The
parameterhisasmoothingparameterwhichneedsto
be chosen with some care. However, since the theory
distribution is flat in hypercube space, the smoothing
parameter may not need to be iteratively determined
over hypercube space to the extent that data distri-
bution is similar to the theory distribution. Even if Figure 5: The distribution of thenegative log likelihood
iterationisused,thevariationinhinhypercubespace ratio NLLR for the nullhypothesis for an ensemble of
is likely to be much smaller. 500 experiments each with thesmoothing factor h=0.2,
Figure 4 shows the distribution of the for as a function of thenumberof events
NLLR
thenullhypothesisforanensembleof500experiments
each with 1000 events as a function of the smoothing
factor h. It can be seen that the distribution narrows
considerably as the smoothing factor increases. We 6. End of Bayesianism?
choose an operating value of 0.2 for h and study the
dependenceofthe asafunctionofthenumber By Bayesianism, we mean the practice of “guess-
NLLR
ofeventsrangingfrom100to1000events,asshownin ing” a prior distribution and introducing it into the
figure 5. The dependence on the number of events is calculations. In what follows we will show that what
seentobe weak,indicatinggoodbehavior. ThePDE is conventionally thought of as a Bayesian prior dis-
thusarrivedcomputedwithh=0.2canbetransformed tributionisinrealityanumberthatcanbecalculated
from the hypercube space to c space and will repro- from the data. We are able to do this since we use
ducedatasmoothlyandwithnoedgeeffects. Wenote two pdf’s, one for theory and one for data. In what
MOCT003
PHYSTAT2003, SLAC, Stanford, California, September 8-11,2003 5
follows, we will interpret the probability distribution taken to be the maximum likelihood point of the pdf
ofthe parameters in a strictly frequentistsense. The (s). It may coincide with the mean value of the
n
P
pdf ofsisthedistributionofthebestestimatorofthe pdf (s). These statements areassertionsofthe un-
n
P
truevalues ofsfromanensembleofaninfinitenum- biased nature of the data from the experiment. At
T
ber of identical experiments with the same statistical this point, there is no information available on where
power n. the true value s lies. One can make the hypothesis
T
that a particular value of s is the true value and the
probability of obtaining a best estimator s∗ from ex-
6.1. Calculation of fitted errors
periments of the type being performed in the interval
s and s +ds is (s )ds . The actual value of
T T T n T T
After the fitting is done and the goodness of fit is P
this number is a function of the experimental resolu-
evaluated, one needs to work out the errors on the
tion and the statistics n of the experiment. The joint
afidbtetnoesudittyqtuhPaen(msti|atcixneis)m.,uOwmnheilcihkneelceiadhrsorotieodscpiaonlifcnoutrlmast∗ae,tftirohonempnoaoststieonrngiollyer pgirvoebnabbiylittyhePptrhoeodruyc(ts,ocfnt)hefropmrobtahbeiltihtyeodreentisciatyl eonfdthies
pdf of s at the true value of s, namely (s ), and
n T
experiment, but how such a measurement is likely to P
thetheoreticallikelihoodP(c s)evaluatedatthetrue
n
fluctuateifwerepeattheexperiment. Thejointprob- |
value, which by our hypothesis is s.
ability density P(s,cn) of observing the parameter s
and the data cn is given by Ptheor(s,cn)=Ptheor(cn s) n(sT) (19)
| P
The joint probability P(s,cn) is a joint distribution
Pdata(s,cn)=P(scn)Pdata(cn) (17) of the theoretical parameter s and data cn. The two
| ways of evaluating this (from the theoretical end and
where we use the superscript data to distinguish the the data end) must yield the same result, for consis-
jointprobabilityPdata(s,cn)ashavingcomefromus- tency. This is equivalentto equatingPdata(s,cn) and
ing the data pdf. If we nowintegrate the aboveequa- Ptheor(s,cn). This gives the equation
tion over all possible datasets cn, we get the expres- P(scn)Pdata(cn)=Ptheor(cn s) n(sT) (20)
sion for the pdf of s. | | P
whichisaformofBayes’theorem,butwithtwopdf′s
Pn(s)=Z Pdata(s,cn)dcn =Z P(s|cn)Pdata(cn)dcn (ttiohneocraynabnedimdamtae)d.iaLteeltyurse-nwortietttehnaatsthaeliakbeloihveooedquraa--
(18) tio
where we have used the symbol to distinguish the
fact that it is the true pdf of s oPbtained from an in- = P(s|cn) = Ptheor(cn|s) (21)
finite ensemble. We use the subscript n in Pn(s) to LR Pn(sT) Pdata(cn)
denote that the pdf is obtained from an ensemble of
which is what is used to obtain the goodness of fit.
experiments with n events each. Later on we will
In order to get the fitted errors, we need to evaluate
show that Pn(s) is indeed dependent on n. Equa- P(scn) which necessitates a better understanding of
tion 18 states that in order to obtain the pdf of the |
what (s ) is in equation 20. Rearranging equa-
n T
parameters,oneneedstoaddtogethertheconditional P
tion 20, one gets
probabilitiesP(scn)overanensembleofevents,each
sPudcahtad(cisnt)r.ibAuttiotnh|iswsetigahgteedofbtyhethdeisc“udsastioanl,iktehleihfouondc”- P(s|cn)=LR(s)Pn(sT)= PPthdeaotra((ccnn|)s)Pn(sT)
tions Pdata(scn) are unknown functions. We have (22)
|
howeverworkedout (s)asafunctionofsandhave
R
evaluated the maximLum likelihood value s∗ of s. We 6.1.1.ToshowthatPn(sT)dependsonn
can choose an arbitrary value of s and evaluate the
Inpractice,inboththebinnedandunbinnedcases,
goodness of fit at that value using the likelihood ra- one only has an approximationto Pdata(cn). As n
tio. When we choose an arbitrary value of s, we are →
, in the absence of experimental bias, one expects
in fact hypothesizing that the true value s is at this ∞
T to determine the parameter sets to infinite accuracy;
valueofs. LR(s)thengivesusawayofevaluatingthe and P(scn) δ(s sT), where sT is the true value
relativegoodnessoffitofthe hypothesisaswechange | → −
of s. However, for the null hypothesis, as n ,
s. Let us now take an arbitrary value of s and hy- → ∞
the statistical error introduced by our use of PDE in
pothesize that that is the true value. Then the joint
the unbinned case or by binning in the binned case
probabilityofobservingcn andsT beingatthisvalue becomesnegligiblewiththeresultthatthetheorypdf
of s is given from the data end by equation 17.
describes the data for all c at the true value s . i.e.
T
Similarly, from the theoretical end, one can calcu-
late the joint probability of observing the dataset cn, Ptheor(csT)
| 1asn (23)
with the true value being at s. The true value sT is Pdata(c) → →∞
MOCT003
6 PHYSTAT2003, SLAC, Stanford, California, September 8-11,2003
When one evaluates the likelihood ratio over n
R
L
events, with n , the likelihood ratio does not
→ ∞
necessarily remain unity. This is due to fluctuations
inthedatawhichgrowas (n). Forthebinnedlikeli-
hoodcasewithn bins,onpe canshow thatasn ,
b
→∞
R e− ii==n1bχ2i/2 e−nb/2 (24)
L → P →
This is just an example of the likelihood ratio theo-
rem. If one uses a binned χ2 fit, which can also be
thought of as maximizing a likelihood ratio, one gets
the same limit as when using binned likelihood fits.
The point is that is finite as n . In the
R
L → ∞
unbinned case, we have currently no analytic theory
available. However, one can argue that the binned
case with the number of bins n and n << n
b b
→ ∞
should approachthe unbinned limit. In this case, the
unbinned also is finite for infinite statistics. This Figure 6: Comparison of theusage of Bayesian priors
R
L withthenewmethod. Intheupperfigure,illustratingthe
implies that (s ) as n . i.e (s )
n T n T
P → ∞ → ∞ P Bayesian method, an unknowndistribution is guessed at
depends on n. This puts an end to the notion of a
bytheuser based on “degrees of belief” and the valueof
monolithic Bayesian prior interpretation for (s).
Pn theBayesian prior changes as thevariable s changes. In
thelower figure,an “unknown concomitant” distribution
6.1.2.ToshowthatPn(sT)isconstantwithrespecttos is used whose shape dependson the statistics. In the
When one varies the likelihoodratioin equation22 case of nobias, this distribution peaks at thetrue value
of s. As we change s,we change our hypothesis as to
as a function of s, for each value of s, one is mak-
where thetruevalue of s lies, and thedistribution shifts
ing a hypothesis that s = s . As one changes s, a
T with s as explained in the text. The value of the
newhypothesisisbeingtestedthatismutually exclu-
distribution at thetrue valueis thusindependent of s.
sive from the previous one, since the true value can
only be at one location. So as one changes s, one is
free to move the distribution (s) so that s is at
Pn T data set cn. This is our measurement of αn and dif-
the value of s being tested. This implies that (s )
Pn T ferentdatasetswillgivedifferentvaluesofαn,inother
does not change as one changes s and is a constant
wordsα willhaveasamplingdistributionwithanex-
n
wrt s, which we can now write as α . Figure 6 illus-
n pected value and standard deviation. As n , the
trates these points graphically. Thus (s ) in our →∞
Pn T likelihood ratio R will tend to a finite value at the
equations is a number, not a function. The distri- L
true value and zero for all other values, and α
n
bution (s) should not be thought of as a “prior” →∞
Pn as a result.
but asan“unknownconcomitant”,whichdepends on
Note that it is only possible to write down an ex-
the statistics and the measurementcapabilities of the
pression for α dimensionally when a likelihood ratio
n
apparatus. For a given apparatus, there are a denu-
is available. This leads to
R
merable infinity of such distributions, one for each n. L
TanhdesPend(sisTtr)ib→ut∞ionassbnec→om∞e.narrower as n increases P(s|cn)= LRRds = PP((ccnn|ss))ds (27)
L |
R R
The last equality in equation 27 is the same expres-
6.2. New form of equations
sionthat“frequentists”useforcalculatingtheirerrors
Equation 22 can now be re-written after fitting, namely the likelihood curve normalized
to unity gives the parameter errors. If the likelihood
P(s|cn)= PP(dcanta|s(c)αnn) (25) coufrnveegaistiGvealuosgs-ilainkeslihhaopoeddo,fth12efnrotmhisthjuesotpifitiemsuamchpaonignet
to get the 1σ errors. Evenif it is not Gaussian,as we
Since P(scn) must normalize to unity, one gets for show in section (8), we may use the expression for
|
αn, P(scn) as a pdf of the parameter s to evaluate the
|
errors.
Pdata(cn) 1 The normalization condition
α = = (26)
n P(cn s)ds R(s) ds
| L
We have thus deRtermined αn, tRhe value of the “un- P(cn)=Z Ptheory(s,cn)ds=Z P(cn|s)Pn(sT)ds
known concomitant” at the true value s using our (28)
T
MOCT003
PHYSTAT2003, SLAC, Stanford, California, September 8-11,2003 7
is obeyed by our solution, since Equation 18 gives the prescription for arriving at
(s), given an ensemble of such experiments, the
n
P
Z P(cn|s)Pn(sT) ds=Z αnP(cn|s) ds≡Pdata(cn) cboynttrhiebu“tdioantafrliokmeliheoaochd”exPpdeartiam(cenn)t fboerinthgatweeixgphetreid-
(29)
ment. The “data likelihoods” integrate to unity, i.e
The expression αnP(cn|s) ds in the above equation Pdata(cn)dcn = 1. In the case of only a single ex-
may be thoughtRof as being due to an “unknowncon- Rperiment,withtheobservedcnbeingdenotedbyconbs,
comitant” whose peak value is distributed uniformly
in s space. The likelihoods of the theoretical predic-
tion P(cn s) contribute with equal probability each
|
with a weight αn, to sum up to form the data like- Pdata(cn)=δ(cn cnobs) (32)
lihood Pdata(cn). i.e. the data, due to its statistical −
inaccuracywillentertainarangeoftheoreticalparam-
eters. However,equation 29 does not give us any fur- Equation 18, for a single experiment, then reduces to
therinformation,sinceitisobeyedidentically. Fitting
for the maximum likelihood value s∗ of s is attained
by maximizing the likelihood ratio LR = PPda(ctan(|csn)). Pn(s)=Z P(s|cn)Pdata(cn)dcn =P(s|cnobs) (33)
The goodness of fit is obtained using the value of
R
L
at the maximum likelihood point. The best theoret-
i.e. given a single experiment, the best estimator for
ical prediction is P(cs∗), and this prediction is used
to compare to the da|ta pdf Pdata(c). Note that the Pesnti(ms)a,ttohrefoprdfthoeftsru,eisvaPlu(se|csnobiss)sa∗onbds dtheudsuctehdefbroemst
maximum likelihood value s is also the same point at T
whichthe posteriordensityP(sc)peaks. This istrue theexperiment. WecanthususeP(s|cnobs)asthough
| itis the pdf ofs anddeduce limits anderrorsfromit.
onlyinourmethod. WhenanarbitraryBayesianprior
The proviso is of course that these limits and errors
isused,themaximumlikelihoodvalueisnotthesame
as well as s∗obs come from a single experiment of fi-
point at which the posterior density will peak. Note
nite statistics and as such are subject to statistical
also that the normalization equation (s) ds=1 is
n
P fluctuations.
still valid. The integral R
α ds=1 (30)
Z n 6
since α is our measurement of the value of (s) 9. Comparison with the Bayesian
n n
P
at the true value. It is a measure of the statistcal approach
accuracy of the experiment. The larger the value of
α , the narrowerthe distribution (s) andthe more
n n
P
accurate the experiment.
In the Bayesian approach, an unknown Bayesian
prior P(s) is assumed for the distribution of the pa-
7. Combining Results of Experiments rameter s in the absence of any data. The shape of
the prior is guessedat, basedonsubjective criteria or
Each experiment should publish a likelihood curve using other objective pieces of information. However,
for its fit as well as a number for the data likelihood such a shape is not invariant under transformation of
Pdata(cn). Combining the results of two experiments variables. For example, if we assume that the prior
with m and n experiments each, involves multiplying P(s)is flat ins, then ifwe analyze the problemins2,
the likelihood ratios. we cannot assume it is flat in s2. This feature of the
Bayesian approach has caused controversy. Also, the
P(cm s) P(cn s)
(s)= (s) (s)= | | notion of a pdf of the data does not exist and P(c)
LRm+n LRm ×LRn Pdata(cm)×Pdata(cn) is taken to be a normalization constant. As such, no
(31)
goodness of fit criteria exist. In the method outlined
Posteriordensitiesandgoodnessoffitcanbededuced
here,we haveusedBayes’theoremtocalculateposte-
from the combined likelihood ratio.
riordensitiesofthefittedparameterswhilebeingable
to compute the goodness of fit. The formalism devel-
oped here shows that what is conventionally thought
8. Interpreting the results of one
of as a Bayesian prior distribution is in fact a nor-
experiment malizationconstantandwhat Bayesiansthink of as a
normalization constant is in fact the pdf of the data.
Afterperformingasingleexperimentwithnevents, Table II outlines the major differences between the
we now can calculate P(scn), using equation 27. Bayesianapproach and the new one.
|
MOCT003
8 PHYSTAT2003, SLAC, Stanford, California, September 8-11,2003
This scheme involves the usage of two pdf’s, namely
Table II The keypoints of differencebetween the
data and theory. In the process of computing the fit-
Bayesian method and the newmethod.
ted errors, we have demonstrated that the quantity
Item Bayesian Method New Method inthe joint probabilityequations that has beeninter-
Goodness Absent Now available preted as the “Bayesian prior” is in reality a number
of fit in both binned and not a distribution. This number is the value of
and unbinnedfits thepdf oftheparameter,whichwecallthe“unknown
concomitant”atthetruevalueoftheparameter. This
Data Used in evaluating Used in evaluating
number is calculated from a combination of data and
theory pdf theory pdf
theory and is seen to be an irrelevant parameter. If
at data points at data points
this viewpoint is accepted, the controversial practice
as well as evaluating
ofguessingdistributions for the “BayesianPrior”can
data pdf at data points now be abandoned, as can be the terms “Bayesian”
Prior Is a distribution Noprior needed. and “frequentist”. We show how to use the posterior
that is guessed based Onecalculates a density to rigorously calculate fitted errors.
on “degrees of belief” constant from data
Independentof data, αn = Pdata(cn)
P(cn|s)ds
monolithic →∞ aRs n→∞ Acknowledgments
Posterior Dependson Prior. Independentof prior.
density same as frequentistsuse
P(s|cn) P(cn|s)P(s) P(cn|s) This work is supported by Department of Energy.
P(cn|s)P(s) ds P(cn|s) ds The author wishes to thank Jim Linnemann and Igor
R R Volobouev for useful comments.
10. Further work to be done
Equation 18 can be used to show that the expecta- References
tion value of E(s) of the parameter s is given by
E(s)=Z sPn(s)ds=Z dcnP(cn)Z sP(s|cn)ds(34) [1] UKn.biKninneodshMitaax,im“uEmvaLluikaetilnihgooQdufiatltiitnyg”o,fPrFoicteedin-
ings of the Conference on Advanced Statistical
=Z s¯(cn)P(cn)dcn(35) Techniques in Particle Physics, Durham, March
2002 IPPP/02/39,DCPT/02/78.
where s¯(cn) is the average of s for individual experi- B. Yabsley,“Statistical Practice at the BELLE
ments. Equation 35 states E(s) is the weighted aver- Experiment, and some questions”,ibid.
age of s¯(cn) obtained from individual measurements, R. D. Cousins,“Conference Summary”, ibid.
the weight for each experiment being the “data like- [2] R.A.Fisher,“Onthemathematicalfoundationsof
lihood” P(cn) for that experiment. In the absence of theoretical statistics”, Philos. Trans. R. Soc. Lon-
experimentalbias,E(s)wouldbeidenticaltothetrue don Ser. A 222, 309-368(1922);
values . Itremainstobeshownthattheweightedav- R. A. Fisher,“Theory of statistical estimation”,
T
erage of maximum likelihood values s∗ from indiviual Proc. Cambridge Philos. Soc. 22, 700-725(1925).
experimentsalsoconvergeto themaximumlikelihood [3] “A measure of the goodness of fit in unbinned
point of (s). likelihood fits”, R .Raja, long write-up,
n
P
Also one needs to developananalytic theoryof the http://www-conf.slac.stanford.edu/phystat2003/talks
goodness of fit for unbinned likelihood fits. Finally, /raja/Raja bayes maxlike.pdf
one needs to investigate a bit more closely the trans- [4] “End of Bayesianism?”, R.Raja,
formation properties of (s) under change of vari- http://www-conf.slac.stanford.edu/phystat2003/talks/raja/raja-end
n
P
able. [5] E.Parzen,“Onestimationofaprobabilitydensity
function and mode” Ann.Math.Statis. 32, 1065-
1072 (1962).
11. Conclusions [6] D. Scott. Multivariate Density Estimation. John
Wiley & Sons, 1992.
To conclude, we have proposed a scheme for ob- M. Wand and M.Jones,Kernel Smoothing. Chap-
taining the goodnessof fit in unbinned likelihoodfits. man & Hall, 1995.
MOCT003