Table Of ContentComplex Prediction Problems
A novel approach to multiple Structured Output Prediction
Yasemin Altun
Max-PlanckInstitute
ECML HLIE08
YaseminAltun ComplexPrediction
Information Extraction
Extract structured information from unstructured data
Typical subtasks
NamedEntityRecognition: person,location,organization
names
CoreferenceIdentification: nounphrasesreferingtothe
sameobject
Relationextraction: eg. PersonworksforOrganization
Ultimate tasks
DocumentSummarization
QuestionAnswering
YaseminAltun ComplexPrediction
Complex Prediction Problems
Complex tasks consisting of multiple structured subtasks
Real world problems too complicated for solving at once
Ubiquitous in many domains
NaturalLanguageProcessing
ComputationalBiology
ComputationalVision
YaseminAltun ComplexPrediction
!"#$%"$#&’()$"*$"+!),-.#&’/%"/01-
2345*6&7-
Co!m!p"l#e$x"P%r&e"d’i(c)t*io+n,E-x$a./m.0+p%le/1)20+1+34()56+."0%)
!"&+%7/64)!.6$&.$6")56"70&.0+%
8(
!""#!$%&’%’(#’(&()’&*+*’(*,)-*(#’./)#&%(+#%’0,
4(
!&&&&&&!1!!1!1!1!!1!1!1!1--------!----!-!1!1!1!1!!11!1!1!1!&&&&&1!1!!1!1!!)
Motion Tracking in Computational Vision
! 9:S"u6b/ta11s)k&:+I%de’n.6ti/fy0%jo.i’n(t)a*n+gl,es-o$f.h"u6m):a0n’0b+o%dy());7"%.0<4)
=+0%.)/%31"’)+<)>$,/%)?+74
!"#$%#!% &’()*+,-./01, 2
YaseminAltun ComplexPrediction
Complex Prediction Example
!"#$%"$#&’()$"*$"+!),-.#&’/%"/01-
3-D protein structure prediction in Computational Biology
Subtask: Identify secondary structured Prediction from
amino-acid sequence
2345*6&7-
AAYKSHGSGDYGDHDVGHPTPGDPWVEPDYGINVYHSDTYSGQW
! !"#$"%&"’()*+,-$./.0+%/1)20+1+34()56+."0%)
AAYKSHGSGDYGDHDVGHPTPGDPWVEPDYGINVYHSDTYSGQW
!"&+%7/64)!.6$&.$6")56"70&.0+%
8(
!""#!$%&’%’(#’(&()’&*+*’(*,)-*(#’./)#&%(+#%’0,
4(
!&&&&&&!1!!1!1!1!!1!1!1!1--------!----!-!1!1!1!1!!11!1!1!1!&&&&&1!1!!1!1!!)
! 9:"6/11)&+%’.6/0%.’()*+,-$."6):0’0+%());7"%.Y0as<em4inAl)tun ComplexPrediction
=+0%.)/%31"’)+<)>$,/%)?+74
!"#$%#!% &’()*+,-./01, 2
Standard Approach to Complex Prediction
Pipeline Approach
SUTTON,MCCALLUMANDROHANIMANESH
Define intermediate/sub-tasks y0 S
Solve them individually or in a cPaasrcsaeded manner
NP VP
Use output of subtasks as features (input) for target task
...
NER y01 y02 y03 Chunk y11 y12 y1m
...
POS y1 y1 y1 POS y2 y2 y2 y2 y1
1 2 3 1 2 3 n-1 n
x x x x x x x x
1 2 3 1 2 3 n-1 n
2 X
where for POS andaf)or NER where x:x+POS tags b)
Problems:
ErrorproFpiagguarteio1n: Graphical representation of (a) linear-chain CRF, and (b) factorial CRF. Although the
Nolearningacrosshtaidsdkesn nodes can depend on observations at any time step, for clarity we have shown
linksonlytoobservationsatthesametimestep.
YaseminAltun ComplexPrediction
parameterestimationusingBP(Section3.3.2), andcascadedparameterestimation(Section3.3.3).
Then, in Section 3.4, we describe inference and parameter estimation in marginal DCRFs. In Sec-
tion 4, we present the experimental results, including evaluation of factorial CRFs on noun-phrase
chunking(Section4.1),comparisonofBPschedulesinFCRFs(Section4.2),evaluationofmarginal
DCRFsonboththechunkingdataandsyntheticdata(Section4.3),andcascadedtrainingofDCRFs
fortransferlearning(Section4.4). Finally,inSection5andSection6,wepresentrelatedworkand
conclude.
2. ConditionalRandomFields(CRFs)
Conditional random fields (CRFs) (Lafferty et al., 2001; Sutton and McCallum, 2006) are condi-
tionalprobabilitydistributionsthatfactorizeaccordingtoanundirectedmodel. CRFsaredefinedas
follows. Letybeasetofoutputvariablesthatwewishtopredict,andxbeasetofinputvariables
that are observed. For example, in natural language processing, x may be a sequence of words
x= x fort =1,...,T and y= y a sequence of labels. Let G be a factor graph over y and x
t t
{ } { }
withfactorsC= ! (y ,x ) ,wherex isthesetofinputvariablesthatareargumentstothelocal
c c c c
{ }
function ! , and similarly for y . A conditional random field is a conditional distribution p that
c c !
factorizesas
1
p (y x)= !! (y ,x ),
! c c c
| Z(x)
c C
∈
whereZ(x)="y#c C!c(yc,xc)isanormalizationfactoroverallstatesequencesforthesequence
∈
x. Weassumethepotentialsfactorizeaccordingtoasetoffeatures f ,as
k
{ }
! (y ,x )=exp "$ f (y ,x ) ,
c c c k k c c
! k "
so that the family of distributions p is an exponential family. In this paper, we shall assume
!
{ }
thatthefeaturesaregivenandfixed. Themodelparametersareasetofrealweights%= $ ,one
k
{ }
weightforeachfeature.
Many previous applications use the linear-chain CRF, in which a first-order Markov assump-
tionismadeonthehiddenvariables. AgraphicalmodelforthisisshowninFigure1. Inthiscase,
the cliques of the conditional model are the nodes and edges, so that there are feature functions
696
New Approach to Complex Prediction
Proposed approach:
Solve tasks jointly discriminatively
Decomposemultiplestructuredtasks
Usemethodsfrommultitasklearning
Goodpredictorsareitsmooth
Restrictthesearchspaceforsmoothfunctionsofalltasks
Devicetargetedapproximationmethods
Standardapproximationalgorithmsdonotcapturespecifics
Dependencieswithintasksarestrongerthandependencies
acrosstasks
Advantages
Less/noerrorpropagation
Enableslearningacrosstasks
YaseminAltun ComplexPrediction
Structured Output (SO) Prediction
Supervised Learning
Giveninput/outputpairs(x,y)∈X ×Y
Y ={0,...,m},Y =(cid:60)
Datafromunknown/fixeddistributionD overX ×Y
Goal: Learnamappingh:X →Y
State-of-theartarediscriminative,eg. SVMs,Boosting
In Structured Output prediction,
Multivariateresponsevariablewithstructuraldependency.
|Y|: exponentialinnumberofvariables
Sequences,tree,hierarchicalclassification,ranking
YaseminAltun ComplexPrediction
SO Prediction
Generative framework: Model P(x,y)
Advantages: Efficientlearningandinferencealgorithms
Disadvantages: Harderproblem,Questionable
independenceassumption,Limitedrepresentation
Local approaches: eg. [Roth, 2001]
Advantages: Efficientalgorithms
Disadvantages: Ignore/problematiclongrange
dependencies
Discriminative learning
Advantages: Richerrepresentationviakernels,capture
dependencies
Disadvantages: Expensivecomputation(SOprediction
involvesiterativelycomputingmarginalsorbestlabeling
duringtraining)
YaseminAltun ComplexPrediction
Formal Setting
Given S = ((x ,y ),...,(x,y ))
1 1 l n
Find h : X → Y,h(x) = argmax F(x,y)
y
Linear discriminant function F : X ×Y → R
FSU(xTT,OyN),=M(cid:104)CψC(AxL,LyU)M,wA(cid:105)NDROHANIMANESH
w
Cost function: ∆(y,y(cid:48)) ≥ 0 eg. 0-1 loss, Hamming loss
Canonical example: Label sequence learning, where both
x and y are sequences
Figure1: Graphical repreYsaesenmtinaAtiltounn ofC(oam)plelxiPnreedaicrti-ocnhain CRF, and (b) factorial CRF. Although the
hidden nodes can depend on observations at any time step, for clarity we have shown
linksonlytoobservationsatthesametimestep.
parameterestimationusingBP(Section3.3.2), andcascadedparameterestimation(Section3.3.3).
Then, in Section 3.4, we describe inference and parameter estimation in marginal DCRFs. In Sec-
tion 4, we present the experimental results, including evaluation of factorial CRFs on noun-phrase
chunking(Section4.1),comparisonofBPschedulesinFCRFs(Section4.2),evaluationofmarginal
DCRFsonboththechunkingdataandsyntheticdata(Section4.3),andcascadedtrainingofDCRFs
fortransferlearning(Section4.4). Finally,inSection5andSection6,wepresentrelatedworkand
conclude.
2. ConditionalRandomFields(CRFs)
Conditional random fields (CRFs) (Lafferty et al., 2001; Sutton and McCallum, 2006) are condi-
tionalprobabilitydistributionsthatfactorizeaccordingtoanundirectedmodel. CRFsaredefinedas
follows. Letybeasetofoutputvariablesthatwewishtopredict,andxbeasetofinputvariables
that are observed. For example, in natural language processing, x may be a sequence of words
x= x fort =1,...,T and y= y a sequence of labels. Let G be a factor graph over y and x
t t
{ } { }
withfactorsC= ! (y ,x ) ,wherex isthesetofinputvariablesthatareargumentstothelocal
c c c c
{ }
function ! , and similarly for y . A conditional random field is a conditional distribution p that
c c !
factorizesas
1
p (y x)= !! (y ,x ),
! c c c
| Z(x)
c C
∈
whereZ(x)="y#c C!c(yc,xc)isanormalizationfactoroverallstatesequencesforthesequence
∈
x. Weassumethepotentialsfactorizeaccordingtoasetoffeatures f ,as
k
{ }
! (y ,x )=exp "$ f (y ,x ) ,
c c c k k c c
! k "
so that the family of distributions p is an exponential family. In this paper, we shall assume
!
{ }
thatthefeaturesaregivenandfixed. Themodelparametersareasetofrealweights%= $ ,one
k
{ }
weightforeachfeature.
Many previous applications use the linear-chain CRF, in which a first-order Markov assump-
tionismadeonthehiddenvariables. AgraphicalmodelforthisisshowninFigure1. Inthiscase,
the cliques of the conditional model are the nodes and edges, so that there are feature functions
696
Description:Complex Prediction Example. Motion Tracking in Computational Vision. Subtask:
Identify joint angles of human body. Overall constraints: Computer vision: