Table Of ContentModelling Competitive Sports:
Bradley-Terry-Élo˝ Models
for Supervised and On-Line Learning
of Paired Competition Outcomes
7
1 Franz J. Király ∗ 1 and Zhaozhi Qian † 12
0
2
1 Department of Statistical Science, University College London,
n
Gower Street, London WC1E 6BT, United Kingdom
a
J 2King Digital Entertainment plc, Ampersand Building,
7 178 Wardour Street, London W1F 8FY, United Kingdom
2
] January 30, 2017
L
M
.
t
a Abstract
t
s
Prediction and modelling of competitive sports outcomes has received much recent attention, es-
[
pecially from the Bayesian statistics and machine learning communities. In the real world setting of
1 outcomeprediction,theseminalÉlo˝updatestillremains,aftermorethan50years,avaluablebaseline
v whichisdifficulttoimproveupon,thoughinitsoriginalformitisaheuristicandnotaproperstatistical
5
“model”.Mathematically,theÉlo˝ratingsystemisverycloselyrelatedtotheBradley-Terrymodels,which
5
areusuallyusedinanexplanatoryfashionratherthaninapredictivesupervisedoron-linelearningset-
0
ting.
8
Exploiting this close link between these two model classes and some newly observed similarities,
0
. weproposeanewsupervisedlearningframeworkwithclosesimilaritiestologisticregression,low-rank
1
matrix completion and neural networks. Building on it, we formulate a class of structured log-odds
0
models, unifying the desirable properties found in the above: supervised probabilistic prediction of
7
scoresandwins/draws/losses,batch/epochandon-linelearning,aswellasthepossibilitytoincorporate
1
featuresintheprediction,withouthavingtosacrificesimplicity,parsimonyoftheBradley-Terrymodels,
:
v orcomputationalefficiencyofÉlo˝’soriginalapproach.
i Wevalidatethestructuredlog-oddsmodellingapproachinsyntheticexperimentsandEnglishPremier
X
Leagueoutcomes,wheretheaddedexpressivityyieldsthebestpredictionsreportedinthestate-of-art,
r
a closetothequalityofcontemporarybettingodds.
∗[email protected]
†[email protected]
1
Contents
1 Introduction 4
1.1 Modellingandpredictingcompetitivesports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Historyofcompetitivesportsmodelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Aimofcompetitivesportsmodelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Mainquestionsandchallengesincompetitivesportsoutcomesprediction . . . . . . . . . . . 5
1.5 Maincontributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Manuscriptstructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 TheMathematical-StatisticalSetting 7
2.1 Supervisedpredictionofcompetitiveoutcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 TheGenerativeModel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 TheObservationModel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 TheLearningTask.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Lossesforprobablisticclassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Learningwithstructuredandsequentialdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Conditioningonthepairing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Conditioningontime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Approachestocompetitivesportsprediction 12
3.1 TheBradley-Terry-Élo˝models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 TheoriginalformulationoftheÉlo˝model . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Bradley-Terry-Élo˝models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Glickman’sBradley-Terry-Élo˝model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.4 LimitationsoftheBradley-Terry-Élo˝modelandexistingremedies . . . . . . . . . . . . 16
3.2 Domain-specificparametricmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 BivariatePoissonregressionandextensions . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Bayesianlatentvariablemodels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Feature-basedmachinelearningpredictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Evaluationmethodsusedinpreviousstudies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 ExtendingtheBradley-Terry-Élo˝ model 21
4.1 Thestructuredlog-oddsmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Statisticaldefinitionofstructuredlog-oddsmodels . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Importantspecialcases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.3 Connectiontoexistingmodelclasses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Predictingnon-binarylabelswithstructuredlog-oddsmodels . . . . . . . . . . . . . . . . . . . 26
4.2.1 Thestructuredlog-oddsmodelwithfeatures. . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Predictingternaryoutcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.3 Predictingscoreoutcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Trainingofstructuredlog-oddsmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.1 Thelikelihoodofstructuredlog-oddsmodels . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.2 Batchtrainingofstructuredlog-oddsmodels . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.3 On-linetrainingofstructuredlog-oddsmodels . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Rankregularizedlog-oddsmatrixestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2
5 Experiments 34
5.1 Syntheticexperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.1 Two-factorBradley-Terry-Élo˝model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1.2 Rank-fourBradley-Terry-Élo˝model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.3 Regularizedlog-oddsmatrixestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 PredictionsontheEnglishPremierLeague. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Descriptionofthedataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.2 Validationsetting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.3 PredictionStrategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.4 Quantitativecomparisonfortheevaluationmetrics . . . . . . . . . . . . . . . . . . . . 40
5.2.5 Performanceofthestructuredlog-oddsmodel . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.6 Performanceofthebatchlearningmodels . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 FairnessoftheEnglishPremierLeagueranking. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 DiscussionandSummary 48
6.1 Methodologicalfindings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 FindingsontheEnglishPremierLeague . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3 Openquestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3
1. Introduction
1.1. Modellingandpredictingcompetitivesports
Competitivesportsreferstoanysportthatinvolvestwoteamsorindividualscompetingagainsteachother
to achieve higher scores. Competitive team sports includes some of the most popular and most watched
gamessuchasfootball,basketballandrugby. Suchsportsareplayedbothindomesticprofessionalleagues
such as the National Basketball Association, and international competitions such as the FIFA World Cup.
For football alone, there are over one hundred fully professional leagues in 71 countries globally. It is
estimatedthatthePremierLeague,thetopfootballleagueintheUnitedKingdom,attracteda(cumulative)
televisionaudienceof4.7billionviewersinthelastseason[47].
The outcome of a match is determined by a large number of factors. Just to name a few, they might
involvethecompetitivestrengthofeachindividualplayerinbothteams,thesmoothnessofcollaboration
betweenplayers,andtheteam’sstrategyofplaying. Moreover,thecompositionofanyteamchangesover
the years, for example because players leave or join the team. The team composition may also change
withinthetournamentseasonorevenduringamatchbecauseofinjuriesorpenalties.
Understanding these factors is, by the prediction-validation nature of the scientific method, closely
linked to predicting the outcome of a pairing. By Occam’s razor, the factors which empirically help in
predictionareexactlythosethatonemayhypothesizetoberelevantfortheoutcome.
Since keeping track of all relevant factors is unrealistic, of course one cannot expect a certain pre-
diction of a competitive sports outcome. Moreover, it is also unreasonable to believe that all factors can
bemeasuredorcontrolled,henceitisreasonabletoassumethatunpredictable,ornon-deterministicsta-
tistical “noise” is involved in the process of generating the outcome (or subsume the unknowns as such
noise). A good prediction will, hence, not exactly predict the outcome, but will anticipate the “correct”
odds more precisely. The extent to which the outcomes are predictable may hence be considered as a
surrogatequantifierofhowmuchtheoutcomeofamatchisinfluencedby“skill”(assurrogatedbydeter-
minism/prediction),orby“chance”1 (assurrogatedbythenoise/unknownfactors).
Phenomenawhichcannotbespecifieddeterministicallyareinfactverycommoninnature. Statistics
and probability theory provide ways to make inference under randomness. Therefore, modelling and
predicting the results of competitive team sports naturally falls into the area of statistics and machine
learning. Moreover, any interpretable predictive model yields a possible explanation of what constitutes
factorsinfluencingtheoutcome.
1.2. Historyofcompetitivesportsmodelling
Research of modeling competitive sports has a long history. In its early days, research was often closely
related to sports betting or player/team ranking [22, 26]. The two most influential approaches are due
to Bradley and Terry [3] and Élo˝ [15]. The Bradley-Terry and Élo˝ models allow estimation of player
rating; theÉlo˝ systemadditionallycontainsalgorithmicheuristicstoeasilyupdateaplayer’srank,which
havebeeninuseforofficialchessrankingssincethe1960s. TheÉlo˝systemisalsodesignedtopredictthe
oddsofaplayerwinningorlosingtotheopponent. Incontemporarypractice,Bradley-TerryandÉlo˝type
models are broadly used in modelling of sports outcomes and ranking of players, and it has been noted
thattheyareveryclosemathematically.
Inmorerecentdays,relativelydiversemodellingapproachesoriginatingfromtheBayesianstatistical
framework [37, 13, 20], and also some inspired by machine learning principles [36, 23, 43] have been
applied for modelling competitive sports. These models are more expressive and remove some of the
1Weexpresslyavoiduseoftheword“luck”asinvernacularuseitoftenmeans“chance”,jointlywiththebeliefthatitmaybe
influencedbyesoterical,magicalorotherwisemetaphysicalmeans. Whileinthesuggestedsurrogateuse,itmaywellbethatthe
“chance”componentofamodelsubsumespossiblepointsofinfluencewhichsimplyarenotmeasuredorobservedinthedata,an
extremelystrongcorpusofscientificevidenceimpliesthatthesewillnotbemetaphysical,onlyunknown-twoqualifierswhichare
obviouslynotthesame,despitestronghumantendenciestobelievethecontrary.
4
Bradley-Terry and Élo˝ models’ limitations, though usually at the price of interpretability, computational
efficiency,orboth.
A more extensive literature overview on existing approaches will be given later in Section 3, as lit-
erature spans multiple communities and, in our opinion, a prior exposition of the technical setting and
simultaneousstraighteningofthoughtsbenefitstheunderstandingandallowsustogivepropercreditand
contextforthewidelydifferentideasemployedincompetitivesportsmodelling.
1.3. Aimofcompetitivesportsmodelling
Inliterature,thestudyofcompetitiveteamsportsmaybeseentoliebetweentwoprimarygoals. Thefirst
goalistodesignmodelsthatmakegoodpredictionsforfuturematchoutcome. Thesecondgoalistoun-
derstandthekeyfactorsthatinfluencethematchoutcome,mostlythroughretrospectiveanalysis[45,50].
Asexplainedabove,thesetwoaspectsareintrinsicallyconnected,andinourviewtheyarethetwofacets
of a single problem: on one hand, proposed influential factors are only scientifically valid if confirmed
by falsifiable experiments such as predictions on future matches. If the predictive performance does not
increase when information about such factors enters the model, one should conclude by Occam’s razor
that these factors are actually irrelevant2. On the other hand, it is plausible to assume that predictions
are improved by making use of relevant factors (also known as “features”) become available, for exam-
ple because they are capable of explaining unmodelled random effects (noise). In light of this, the main
problem considered in this work is the (validable and falsifiable) prediction problem, which in machine
learningterminologyisalsoknownasthesupervisedlearningtask.
1.4. Mainquestionsandchallengesincompetitivesportsoutcomesprediction
Giventheabovediscussion,themajorchallengesmaybestatedasfollows:
On the methodological side, what are suitable models for competitive sports outcomes? Current
modelsarenotatthesametimeinterpretable,easilycomputable,allowtousefeatureinformationonthe
teams/players,andallowtopredictscoresorternaryoutcomes. Itisanopenquestionhowtoachievethis
inthebestway,andthismanuscriptattemptstohighlightapossiblepath.
Themaintechnicaldifficultyliesinthefactthatoff-shelfmethodsdonotapplyduetothestructured
natureofthedata: unlikeinindividualsportssuchasrunningandswimmingwheretheoutcomedepends
onlyonthegiventeam,andwherethepredictiontaskmaybedealtwithclassicalstatisticsandmachine
learningtechnology(see[2]foradiscussionofthisinthecontextofrunning),incompetitiveteamsports
theoutcomemaybedeterminedbypotentiallycomplexinteractionsbetweentwoopposingteams. Inpar-
ticular, the performance of any team is not measured directly using a simple metric, but only in relation
totheopposingteam’sperformance.
Onthesideofdomainapplications,whichinthismanuscriptispremierleaguefootball,itisofgreat
interest to determine the relevant factors determining the outcome, the best way to predict, and which
rankingsystemsarefairandappropriate.
Allthesequestionsarerelatedtopredictivemodelling, aswellastheavailabilityofsuitableamounts
ofqualitydata. Unfortunately,thescarcityoffeaturesavailableinsystematicpresentationplacesahurdle
toacademicresearchincompetitiveteamsports,especiallywhenitcomestoassessingimportantfactors
suchasteammembercharacteristics,orstrategicconsiderationsduringthematch.
Moreover,closelylinkedisalsothequestiontowhichextenttheoutcomesaredeterminedby“chance”
as opposed to “skill”. Since if on one hypothetical extreme, results would prove to be completely unpre-
dictable,therewouldbenoempiricalevidencetodistinguishthematchesfromagameofchancesuchas
2...todistinguish/characterizetheobservations,whichinsomecasesmayplausiblypertaintorestrictionsinsetofobservations,
ratherthantocausativerelevance.Hypotheticalexample:ageoffootballplayersmaybeidentifiedasunimportantfortheoutcome
-whichmayplausiblybeduetothefactthatthedatacontainednoplayersofages5or80,say,asopposedtoplayeragebeing
unimportantingeneral.Rephrased,itisonlyunimportantforcasesthatareplausibletobefoundinthedatasetinthefirstplace.
5
flipping a coin. On the other hand, importance of a measurement for predicting would strongly suggest
itsimportanceforwinning(orlosing),thoughwithoutanexperimentnotnecessarilyacausativelink.
We attempt to address these questions in the case of premier league football within the confines of
readilyavailabledata.
1.5. Maincontributions
Ourmaincontributionsinthismanuscriptarethefollowing:
(i) Wegivewhatwebelievetobethefirstcomprehensiveliteraturereviewofstate-of-artcompetitive
sportsmodellingthatcomprisesthemultiplecommunities(Bradley-Terrymodels,Élo˝ typemodels,
Bayesianmodels,machinelearning)inwhichresearchsofarhasbeenconductedmostlyseparately.
(ii) WepresentaunifiedBradley-Terry-Élo˝modelwhichcombinesthestatisticalrigouroftheBradley-
Terrymodels withfittingandupdate strategiessimilarto thatfoundin theÉlo˝ system. Mathemat-
ically only a small step, this joint view is essential in a predictive/supervised setting as it allows
efficient training and application in an on-line learning situation. Practically, this step solves some
problemsoftheÉlo˝system(includingrankinginitializationandchoiceofK-factor),andestablishes
closerelationstologisticregression,low-rankmatrixcompletion,andneuralnetworks.
(iii) ThisunifiedviewonBradley-Terry-Élo˝ allowsustointroduceclassesofjointextensions,thestruc-
tured log-odds models, which unites desirable properties of the extensions found in the disjoint
communities: probabilistic prediction of scores and wins/draws/losses, batch/epoch and on-line
learning,aswellasthepossibilitytoincorporatefeaturesintheprediction,withouthavingtosacri-
ficestructuralparsimonyoftheBradley-Terrymodels,orsimplicityandcomputationalefficiencyof
Élo˝’soriginalapproach.
(iv) We validate the practical usefulness of the structured log-odds models in synthetic experiments
and in answering domain questions on English Premier League data, most prominently on the
importanceoffeatures,fairnessoftheranking,aswellasonthe“chance”-“skill”divide.
1.6. Manuscriptstructure
Section2givesanoverviewofthemathematicalsettingincompetitivesportsprediction. Buildingonthe
technical context, Section 3 presents a more extensive review of the literature related to the prediction
problemofcompetitivesports,andintroducesajointviewonBradley-TerryandÉlo˝typemodels. Section4
introducesthestructuredlog-oddsmodels,whicharevalidatedinempiricalexperimentsinSection5. Our
resultsandpossiblefuturedirectionsforresearcharediscussedinsection6.
Authors’contributions
This manuscript is based on ZQ’s MSc thesis, submitted September 2016 at University College London,
writtenundersupervisionofFK.FKprovidedtheideasofre-interpretationandpossibleextensionsofthe
Élo˝model. LiteratureoverviewisjointlyduetoZQanFQ,andinpartsfollowssomeveryhelpfulpointers
by I. Kosmidis (see below). Novel technical ideas in Sections 4.2 to 4.4, and experiments (set-up and
implementation)aremostlyduetoZQ.
The present manuscript is a substantial re-working of the thesis manuscript, jointly done by FK and
ZQ.
Acknowledgements
WearethankfultoIoannisKosmidisforcommentsonanearlierformofthemanuscript,forpointingout
someearlieroccurrencesofideaspresentedinitbutnotgivenpropercredit,aswellasrelevantliterature
inthe“Bradley-Terry”branch.
6
2. The Mathematical-Statistical Setting
This section formulates the prediction task in competitive sports and fixes notation, considering as an
instanceofsupervisedlearningwithseveralnon-standardstructuralaspectsbeingofrelevance.
2.1. Supervisedpredictionofcompetitiveoutcomes
Weintroducethemathematicalsettingforoutcomepredictionincompetitiveteamsports. Asoutlinedin
theintroductorySection1.1,threecrucialfeaturesneedtobetakenintoaccountinthissetting:
(i) Theoutcomeofapairingcannotbeexactlypredictedpriortothegame,evenwithperfectknowledge
of all determinates. Hence it is preferable to predict a probabilistic estimate for all possible match
outcomes(win/draw/loss)ratherthandeterministicallychoosingoneofthem.
(ii) In a pairing, two teams play against each other, one as a home team and the other as the away or
guest team. Not all pairs may play against each other, while others may play multiple times. As a
mathematicallyprototypical(thoughinaccurate)sub-caseonemayconsiderallpairsplayingexactly
once,whichgivestheobservationsanimplicitmatrixstructure(row=hometeam,column=away
team). Outcomelabelsandfeaturescruciallydependontheteamsconstitutingthepairing.
(iii) Pairings take place over time, and the expected outcomes are plausibly expected to change with
(possiblyhidden)characteristicsoftheteams. Hencewewillmodelthetemporaldependenceexplic-
itlytobeabletotakeitintoaccountwhenbuildingandcheckingpredictivestrategies.
2.1.1. TheGenerativeModel. Followingtheabovediscussion,wewillfixagenerativemodelasfollows:
asinthestandardsupervisedlearningsetting,wewillconsideragenerativejointrandomvariable(X,Y)
taking values in X×Y, where X is the set of features (or covariates, independent variables) for each
pairing,whileYisthesetoflabels(oroutcomevariables,dependentvariables).
In our setting, we will consider only the cases Y = {win,lose} and Y = {win,lose,draw}, in which
case an observation from Y is a so-called match outcome, as well as the case Y = (cid:78)2, in which case an
observationisaso-calledfinalscore(inwhichcase,byconvention,thefirstcomponentofYisofthehome
team),orthecaseofscoredifferenceswhereY=(cid:78)(inwhichcase,byconvention,apositivenumberisin
favour of the home team). From the official rule set of a game (such as football), the match outcome is
uniquelydeterminedbyascoreorscoredifference. AsalltheabovesetsYarediscrete,predictingYwill
amounttosupervisedclassification(thescoredifferenceproblemmaybephrasedasaregressionproblem,
butwewillabstainfromdoingsofortechnicalreasonsthatbecomeapparentlater).
TherandomvariableX anditsdomainXshallincludeinformationontheteamsplaying,aswellason
thetimeofthematch.
We will suppose there is a set I of teams, and for i,j ∈ I we will denote by (X ,Y ) the random
ij ij
variable (X,Y) conditioned on the knowledge that i is the home team, and j is the away team. Note
that information in X can include any knowledge on either single team i or j, but also information
ij
correspondinguniquelytothepairing(i,j).
We will assume that there areQ :=#I teams, which means that the X and Y may be arranged in
ij ij
(Q×Q)matriceseach.
Further therewill be a setT of time pointsat which matches areobserved. For t ∈T we willdenote
by(X(t),Y(t))or(X (t),Y (t))anadditionalconditioningthattheoutcomeisobservedattimepoint t.
ij ij
NotethattheindexingX (t)andY (t)formallyamountstoadoubleconditioningandcouldbewrit-
ij ij
tenasX|I =i,J = j,T =tandY|I =i,J = j,T =t,whereI,J,T arerandomvariablesdenotingthehome
team,theawayteam,andthetimeofthepairing. Thoughwedobelievethattheindex/bracketnotation
iseasiertocarrythroughandtofollow(includinganexplicitmirroringofthethe“matrixstructure”)than
theconditionalor“graphicalmodels”typenotation,whichisourmainreasonforadoptingtheformerand
notthelatter.
7
2.1.2. The Observation Model. By construction, the generative random variable (X,Y) contains all
information on having any pairing playing at any time, However, observations in practice will concern
twoteamsplayingatacertaintime,henceobservationsinpracticewillonlyincludeindependentsamples
of(X (t),Y (t))forsomei,j∈I,t∈T,andneverfullobservationsof(X,Y)whichcanbeinterpretedas
ij ij
alatentvariable.
Notethattheobservationscanbe,in-principle,correlated(orunconditionallydependent)ifthepairing
(i,j)orthetime t isnotmadeexplicit(byconditioningwhichisimplicitintheindicesi,j,t).
An important aspect of our observation model will be that whenever a value of X (t) or Y (t) is
ij ij
observed, itwillalwayscometogetherwiththeinformationoftheplayingteams(i,j)∈I2 andthetime
t ∈T at which it was observed. This fact will be implicitly made use of in description of algorithms and
validation methodology. (formally this could be achieved by explicitly exhibiting/adding I×I×T as a
CartesianfactorofthesamplingdomainsXorYwhichwewillnotdoforreasonsofclarityandreadability)
Twoindependentbatchesofdatawillbeobservedintheexposition. Wewillconsider:
atrainingsetD:={(X(1)(t ),Y(1)(t )),...,(X(N)(t ),Y(N)(t ))}
i j 1 i j 1 i j N i j N
11 11 N N N N
atestsetT:={(X(1∗)(t∗),Y(1∗)(t∗)),...,(X(M∗)(t∗ ),Y(M∗)(t∗ ))}
i∗j∗ 1 i∗j∗ 1 i∗ j∗ M i∗ j∗ M
1 1 1 1 M M M M
where(X(i),Y(i))and(X(i∗),Y(i∗))arei.i.d.samplesfrom(X,Y).
Notethatunfortunately(fromanotationalperspective),onecannotomitthesuperscriptsκasinX(κ)
whendefiningthesamples, sincethefigurative“dies”shouldbecastanewforeachpairingtakingplace.
Inparticular,ifallgameswouldconsistofasinglepairofteamsplayingwheretheresultsareindependent
of time, they would all be the same (and not only identically distributed) without the super-index, i.e.,
withoutdistinguishingdifferentgamesasdifferentsamplesfrom(X,Y).
2.1.3. The Learning Task. As set out in the beginning, the main task we will be concerned with is
predicting future outcomes given past outcomes and features, observed from the process above. In this
work, the features will be assumed to change over time slowly. It is not our primary goal to identify the
hiddenfeaturesin(X,Y),astheyareneverobservedandhencenotaccessibleasgroundtruthwhichcan
validateourmodels. However,thesewillbeofsecondaryinterestandconsideredempiricallyvalidatedby
awell-predictingmodel.
Moreprecisely,wewilldescribemethodologyforlearningandvalidatingpredictivemodelsofthetype
f :X×I×I×T→Distr(Y),
where Distr(Y) is the set of (discrete probability) distributions on Y. That is, given a pairing (i,j) and a
time point t at which the teams i and j play, and information of type x = X (t), make a probabilistic
ij
prediction f(x,i,j,t)oftheoutcome.
MostalgorithmswediscusswillnotuseaddedinformationinX,hencewillbeoftype f :I×I×T→
Distr(Y). Some will disregard the time in T. Indeed, the latter algorithms are to be considered scientific
baselinesabovewhichanyalgorithmusinginformationinXand/orT hastoimprove.
The models f above will be learnt on a training set D, and validated on an independent test set T
as defined above. In this scenario, f will be a random variable which may implicitly depend on D but
willbeindependentofT. Thelearningstrategy-whichis f dependingonD-maytakeanyformandis
consideredinafullblack-boxsense. Intheexposition, itwillinfacttaketheformofvariousparametric
andnon-parametricpredictionalgorithms.
Thegoodnessofsuchan f willbeevaluatedbyaloss L :Distr(Y)×Y→(cid:82)whichcomparesaproba-
bilisticpredictiontothetrueobservation. Thebest f willhaveasmallexpectedgeneralizationloss
(cid:34)(f|i,j,t):=(cid:69)(X,Y)(cid:148)L(cid:128)f(Xij(t),i,j,t),Yij(t)(cid:138)(cid:151)
8
atanyfuturetimepoint t andforanypairingi,j. Undermildassumptions,wewillarguebelowthatthis
quantityisestimablefromT andonlymildlydependenton t,i,j.
Though a good form for L is not a-priori clear. Also, it is unclear under which assumptions (cid:34)(f|t) is
estimable,duedotheconditioningon(i,j,t)inthetrainingset. Thesespecialaspectsofthecompetitive
sportspredictionsettingswillbeaddressedinthesubsequentsections.
2.2. Lossesforprobablisticclassification
Inordertoevaluatedifferentmodels,weneedacriteriontomeasurethegoodnessofprobabilisticpredic-
tions. Themostcommonerrormetricusedinsupervisedclassificationproblemsisthepredictionaccuracy.
However,theaccuracyisofteninsensitivetoprobabilisticpredictions.
Forexample,onacertaintestcasemodelApredictsawinprobabilityof60%,whilemodelBpredicts
awinprobabilityof95%. Iftheactualoutcomeisnotwin,bothmodelsarewrong. Intermsofprediction
accuracy(oranyothernon-probabilisticmetric),theyareequallywrongbecausebothofthemmadeone
mistake. However,modelBshouldbeconsideredbetterthanmodelAsinceitpredictedthe“true”outcome
withhigheraccuracy.
Similarly,ifalargenumberofoutcomesofafaircointosshavebeenobservedastrainingdata,amodel
that predicts 50% percent for both outcomes on any test data point should be considered more accurate
thanamodelthatpredicts100%percentforeitheroutcome50%ofthetime.
Thereexiststwocommonlyusedcriteriathattakeintoaccounttheprobabilisticnatureofpredictions
which we adopt. The first one is the Brier score (Equation 1 below) and the second is the log-loss or
log-likelihoodloss(Equation2below). Bothlossescompareadistributiontoanobservation,hencemath-
ematicallyhavethesignatureofafunctionDistr(Y)×Y→(cid:82). By(veryslight)abuseofnotation,wewill
identifydistributionson(discrete)Ywithitsprobabilitymassfunction;foradistribution p,for y ∈Ywe
write p formassontheobservation y (=theprobabilitytoobserve y inarandomexperimentfollowing
y
p).
Withthisconvention,log-loss L(cid:96) andBrierloss LBr aredefinedasfollows:
L(cid:96): (p,y)(cid:55)→ −logpy (1)
(cid:88)
L : (p,y)(cid:55)→ (1−p )2+ p2 (2)
Br y y
y∈Y\{y}
Thelog-lossandtheBrierlossfunctionshavethefollowingproperties:
(i) the Brier Score is only defined on Y with an addition/subtraction and a norm defined. This is not
necessarily thecase in oursetting where itmay be thatY={win,lose,draw}. In literature, this is
often identified with Y={1,0,−1}, though this identification is arbitrary, and the Brier score may
changedependingonwhichnumbersareused.
Ontheotherhand,thelog-lossisdefinedforanyYandremainsunchangedunderanyrenamingor
renumberingofadiscreteY.
(ii) Forajointrandomvariable(X,Y)takingvaluesinX×Y,itcanbeshownthattheexpectedlosses
(cid:69)(cid:2)L(cid:96)(f(X),Y)(cid:3)areminimizedbythe“correct”prediction f :x (cid:55)→(cid:128)py =P(Y = y|X =x)(cid:138)y∈Y.
ThetwolossfunctionsusuallyareintroducedasempiricallossesonatestsetT,i.e.,
1 (cid:88)
(cid:34)(cid:98)T(f)= #T L∗(x,y).
(x,y)∈T
Theempiricallog-lossisthe(negativelog-)likelihoodofthetestpredictions.
The empiricalBrier loss, usually called the “Brierscore”, isa straightforward translationof the mean
squarederrorusedinregressionproblemstotheclassificationsetting,astheexpectedmeansquarederror
9
of predicted confidence scores. However, in certain cases, the Brier score is hard to interpret and may
behaveinunintuitiveways[27],whichmaypartlybeseenasaphenomenoncausedbyabove-mentioned
lackofinvarianceunderclassre-labelling.
Giventhisandtheinterpretabilityoftheempiricallog-lossasalikelihood,wewillusethelog-lossas
principalevaluationmetricinthecompetitiveoutcomepredictionsetting.
2.3. Learningwithstructuredandsequentialdata
Thedependencyoftheobserveddataonpairingandtimemakesthepredictiontaskathandnon-standard.
Weoutlinethemajorconsequencesforlearningandmodelvalidation,aswellastheimplicitassumptions
which allow us to tackle these. We will do this separately for the pairing and the temporal structure, as
thesebehaveslightlydifferently.
2.3.1. Conditioningonthepairing Matchoutcomesareobservedforgivenpairings(i,j),thatis,each
feature-label-pair will be of form (X ,Y ), where as above the subscripts denote conditioning on the
ij ij
pairing. Multiple pairings may be observed in the training set, but not all; some pairings may never be
observed.
Thishasconsequencesforbothlearningandvalidatingmodels.
Formodellearning,itneedstobemadesurethatthepairingstobepredictedcanbepredictedfrom
∗
the pairings observed. With other words, the label Y in the test set that we want to predict is (in a
ij
practically substantial way) dependent on the training set D={(X(1),Y(1)),...,(X(N),Y(N))}. Note that
i j i j i j i j
11 11 N N N N
smartmodelswillbeabletopredicttheoutcomeofapairingevenifithasnotbeenobservedbefore,and
evenifithas,itwilluseinformationfromotherpairingstoimproveitspredictions
Forvariousparametricmodels, “predictability”canberelatedtocompletabilityofadatamatrixwith
Y as entries. In section 4, we will relate Élo˝ type models to low-rank matrix completion algorithms;
ij
completioncanbeunderstoodaslow-rankcompletion,hencepredictabilitycorrespondstocompletability.
Though,exactlyworkingcompletabilityoutisnotthemainisnottheprimaryaimofthismanuscript,and
for our data of interest, the English Premier League, all pairings are observed in any given year, so com-
pletabilityisnotanissue. Hencewereferto[33]forastudyoflow-rankmatrixcompletability. General
parametricmodelsmaybetreatedalongsimilarlines.
Formodel-agnosticmodelvalidation,itshouldholdthattheexpectedgeneralizationloss
(cid:34)(f|i,j):=(cid:69)(X,Y)(cid:148)L(cid:128)f(Xij,i,j),Yij(cid:138)(cid:151)
canbewell-estimatedbyempiricalestimationonthetestdata. Forleaguelevelteamsportsdatasets,this
can be achieved by having multiple years of data available. Since even if not all pairings are observed,
usuallythesetofpairingswhichisobservedis(almost)thesameineachyear,hencethepairingswillbe
similarinthetrainingandtestsetifwholeyears(orhalf-seasons)areincluded. Furtherwewillconsider
anaverageoverallobservedpairings,i.e.,wewillcomputetheempiricallossonthetrainingsetT as
(cid:34)(f):= 1 (cid:88) L(cid:128)f(X ,i,j),Y (cid:138)
(cid:98) #T ij ij
(Xij,Yij)∈T
Bytheaboveargument,thesetofallobservedpairingsinanygivenyearisplausiblymodelledassimilar,
henceitisplausibletoconcludethatthisempiricallossestimatessomeexpectedgeneralizationloss
(cid:34)(f):=(cid:69) (cid:2)L(cid:0)f(X ,I,J),Y (cid:1)(cid:3)
X,Y,I,J IJ IJ
where I,J (possiblydependent)arerandomvariablesthatselectteamswhicharepaired.
Notethatthistypeofaggregateevaluationdoesnotexcludethepossibilitythatpredictionsforsingle
teams(e.g.,newcomersorafterre-structuring)maybeinaccurate,butonlythatthe“average”prediction
10