Table Of ContentQACTIS Enhancements in TREC QA-2006
P. Schone, G. Ciany*, R. Cuttsb , P. McNamee=, J. Mayfield=, Tom Smith=
U.S. Department of Defense
Ft. George G. Meade, MD 20755-6000
ABSTRACT of past-year systems, but we believe it provides an appro-
priate reflection of system performance over time.
The QACTIS system has been tested in previous years at
Afterdiscussionofthesystemenhancements,wecon-
the TREC Question Answering Evaluations. This paper
duct a post-evaluation analysis of the results from the
describes new enhancements to the system specific to
TREC-2006evaluation. Oursystemimprovedslightlythis
TREC-2006, including basic improvements and threshold-
year in terms of factoid and list answering. However, we
ingexperiments,filteredandInternet-supportedpseudo-rel-
experienced10%and20%relativelossesinsystemperfor-
evance feedback for information retrieval, and emerging
manceduerespectivelytounsupportednessandinexactness
statistics-driven question-answering. For contrast, we also
-- numbers which are too large to go without notice. The
compareourTREC-2006systemperformancetothatofour
inexactness losses seem high but hopefully such degrada-
top systems from TREC-2004 and TREC-2005 applied to
tion has been uniformly observed across systems. On the
this year’s data. Lastly, we analyze evaluator-declared
otherhand,theunsupportednessismoresuspicious. Unlike
unsupportedness of factoids and nugget decisions of
manyothersystems,whichusetheInternetasapre-mecha-
“other”questionstounderstandmajornegativechangesin
nism for finding answers and thereby have a chance of
performance for these categories over last year.
recalling the wrong file, our system solely draws its
answersfromTRECdocumentsanddoesnotminetheWeb
1. INTRODUCTION for answers. We conducted a number of post-evaluation
QACTIS (pronounced "cactus"), which breaks out to experiments. Oneoftheseattemptedtomakeadetermina-
"Question-Answering for Cross-Lingual Text, Image, and tion as to whether the unsupported labels were justified.
Speech,"isaresearchprotoype systembeingdevelopedby We found through this analysis no systematic biases.
the U.S. Department of Defense. The goals and descrip- Along a similar vein, at TREC-2005, QACTIS’s “Other”
tionsofthissystemarespecificallydescribedinpastTREC answerer received the highest score, whereas this year, it
descriptions (see Schone, et al., 2004, 2005 in [1], [2]). In suffereda40%degradationinoverallscore. Weconducted
this paper, though, we provide a self-contained description astudytounderstandthisdegradation. Thisevaluationalso
ofmodificationsthathavebeenmadetothesystemin2006. eliminated concerns about potential assessment problems.
Therewerethreemajorpointsofstudyuponwhichwecon-
ducted research this year: (1) basic improvements to the 2. CY2006 SYSTEM ENHANCEMENTS
general processing strategy, (2) information retrieval In 2006, there were a number of new avenues of research
enhancementsasaprefilter,and(3)amovetowardintegra- on QACTIS. As mentioned earlier, these fall into three
tion of more purely-statistical question answering. We maindirections. Specifically,theseinvolvedimprovements
describeeachoftheseresearchavenuesinsomedetail. For to the base system, information retrieval enhancements for
thesakeofdemonstratingtheseimprovements,weevaluate preselection of appropriate documents, and, lastly, a pro-
ourbestsystemsfromTREC-2004andTREC-2005onthis cesstomoveawayfromsymbolicprocessingtoamoresta-
year’s evaluation data as a means of comparison. This tistical system. We discuss each of these in turn.
comparisondoesnotcompletelyre-createallofthenuances
2.1 General System Improvements
2.1.1. Overcoming Competition-level Holes
* Dragon Development Corporation, Columbia, MD
b One major modification was designed to overcome prob-
Henggeler Computer Consultants, Columbia, MD
=
Johns Hopkins Applies Physics Laboratory, Laurel MD lems that only arise with the appearance of fresh data --
Report Documentation Page Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED
2006 2. REPORT TYPE 00-00-2006 to 00-00-2006
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
QACTIS Enhancements in TREC QA-2006
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
U.S. Department of Defense,Fort George G. Meade,MD,20755-6000 REPORT NUMBER
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE 9
unclassified unclassified unclassified
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
problems which unfortunately only really occur during were unsuccessful. We reasoned that use of a more
competitions. In TREC2005, we had noticed that there recently-built content extractor with such resolution
were a number of questions which our parser failed to embedded could be especially beneficial. BBN was able
properlyhandle;therewereotherquestionsforwhichthe togenerateoutputforusfromtheirSERIF[3]engine,and
system did not know what kind of information it was we began work to incorporate this information into an
seeking; there were questions that were so long that our exactnessfilter. Unfortunately,wewerenotabletomake
NIL-tagger generated inadvertent false positives; and useofthisinformationpriortotheevaluation. Ultimately,
lastly, there were many QACTIS-provided answers (as the only additional resolution that we could incorporate
much as 20% relative) marked inexact. intothesystembythetimeofthecompetitionwastoget
The first two of these were readily solved. We the base system to augment city names with their corre-
ensured that the system always parsed any previously- spondingstatenameswhensuchinformationwaspresent
unprocesseddocumentsandthatithandledtheseproperly in the data.
atruntime. Wealsoattemptedtorequirethatanyfactoid
question whose target answer form was unknown would 2.1.2. What is the Actual Information Need?
at least return an entity type. Also, we sought to prevent Basedonevaluationsoverpastyears,wenotedthatQAC-
non-responses for other-style questions by requiring the TISproducederroneousanswesforquestionsaboutcourt
system to revisit the question in a new way if an earlier decisions, court cases, ranks, ball teams, scores, cam-
stage failed to provide a response. These changes were puses, manufacturers, and, in some cases, titles of works
important for yielding a more robust question answerer, ofart. Theseproblemswereduelargelytoissuesofeither
and it is possible that as much as 0.5-1% of the factoid underspecificityortoprovidingahyponymforaconcept
scoreimprovementsisattributabletothesefixes(particu- rather than a required instance of that concept.
larly the parsing fixes). With regard to underspecificity, “teams” provide a
Perhaps the most dramatic change to handling the greatexample. Ifaquestionwereoftheform“Whatteam
problem of the unforeseen was to change the system’s won Super Bowl ...,” it is clear to a human that the team
scoring metric to take length of question into consider- thatisbeingreferencedisanNFLfootballteam. Instead,
ation. Our system attempts to provide a probability-like if “World Cup” replaced “Super Bowl,” the team should
estimate of the likelihood that some answer is legitimate be a soccer team. Formerly, the system would seek out
giventhequestion. Allthenon-stopwordsofthequestion anykindofteamfortheappropriateresponse--aproblem
are important in the question-answering process, so of underspecificity. To avoid this problem, we encoded
longerquestionswillnaturallyhavelowerestimatedprob- knowledge into the system to help it better be able to
abilities than shorter questions. Nevertheless, it had pre- reach the correct level of specificity particularly with
viously been our policy to use a threshold as a means of teams. Likewise, in TREC-2005, the system was pre-
estimating whether an answer should be reported as a paredtoidentifytitlesofworksofart,butwhetherornot
NIL. Thismeantthatlongquestionsweremorelikelyto it could do it was subject to the way the question was
erroneously report NIL as their answers than short ques- posed. We tried to incorporate more generality into its
tions. Through least squares fit, we determined that the ability to recognize when such a work of art was being
probability scores were approximately proportional to requested. We have not removed all problems with this
0.01length_in_wordsforquestionswithatleastthreewords. underspecificity,sothiswillbecontinuedworkaswepre-
Therefore, we multiplied the scores of such questions by pare toward TREC-2007.
100(length_in_words-3). Weconductedanexhaustivesearch The hyponym/instance-of issue is likewise a promi-
nent problem in the system. If the system were to see a
anddeterminedanoptimalnewNILthreshold(of10-12).
question“Whatwasthecourtdecisionin...”or“whatwas
In our developments, this augmentation to the weight
thescore...”thesystemwouldthinkthatitwaslookingfor
seemedtohaveonlylimitedeffectonscorebutprevented
some hyponym of “court decision” and “score” rather
accidental discarding of legitimate answers.
than a particular verdict (guilty, innocent) or a numeric
Thelastchallengewastoovercomethelargenumber
value,respectively. Weimplementedanumberofspecial-
ofanswersthatQACTISproducedthatwerebeingidenti-
ized functions to tackle rarer-occuring questions such as
fied as inexact in past years. Our system might identify
these and ensure that the appropriate type of information
the appropriate last name of a person that should be
was provided to the user.
reportedasananswer,oritmightidentifythecitywithout
thestate. Wehadpreviouslybuiltnameandanaphorares-
2.1.3. Missing Information Needs: Auto-hypernymy
olutionintooursystemwhichwehadnotbeenusing,and
There are related problems to looking for either hyp-
we experimented with various settings of these compo-
onyms or instances of classes which are due to lack of
nentstoseeifwemightgetsomeadditionalgains,butwe
worldknowledgeinsomeareas. Forsystemsthatusethe
Internet to provide them potential answers, they get results for factoid answering and Variant1 actually
around the problem of missing world information. Our yieldedsignificantlossesinlistperformance. Needlessto
basesystemis,forthemostpart,self-containedandwedo say,wechosetonotselecttheseupdateddatasourcesfor
not currently make direct use of the Web. Therefore, we use in our actual evaluation systems. This will be a sub-
need to ingest resources to support our question-answer- ject of study for the future.
ing. Inpastyears,wehavemadeuseofWordNet[4]anda
knowledge source we had previously developed called Table 1: Wikipedia Ingestion
SemanticForests[5],pluswehadtargetedspecificcatego-
Baseline Variant1 Variant2
riesofquestionsandderivedlargeinventoriesofpotential QA Set
concepts under those categories through the use of Wiki- #1 List #1 List #1 List
pedia[6]. Thisyear,however,wetriedtogrowourworld 201-700 165 164
knowledgetomuchmorethanhundredsofcategoriesand 894-1393 116 113
instead try to (a) ingest much or all of Wikipedia’s taxo- 1394-1893 133 133 133
nomic structure, and (b) automatically induce taxonomic 1894-2393 154 .165 152 .137 152 .169
structure on the fly. 1.1-65.5 72 .197 71 .130 71 .188
Forthefirsteffort,wedownloadedtheentireEnglish 66.1-140.5 121 .163 124 .109 121 .157
component of Wikipedia and distilled out all of its lists.
Thenwedevelopedcodethatcouldturnthoselistsintoa 2.1.4. Longer Term Attempts
structure akin to the taxonomic structure required for ThereweretwootherareasofresearchonQACTISwhich
ingestion by our system. Time constraints limited our weundertookwithintentionsofincorporatingbythetime
ability to do this in a flawless fashion (and revisiting this of the evaluation but which required more effort than
issue is certainly in order for the future). expected to make ready on time. These are mentioned
In addition to the use of this Wiki-generated taxo- only briefly for the sake of completeness. One of these
nomic structure, we also experimented with hypernym areas dealt with morphological information and the other
inductionasameansoffindingstillmoreinformation. In with multitiered processing.
2004,Snow,etal[7] describedaninterestingmethodfor With morphology, our system attempts to crudely
hypernym induction that was based on supervised generatedirectedconflationsetsforallofthewordsofthe
machinelearningusingfeaturesderivedfromdependency question and for the documents which hopefully contain
parses. Oncetrained,thelearnercanbeappliedtounan- theanswers. Anumberofquestionshavebeenanswered
notated corpora to identify new hypenym pairs. Snow incorrectly due to incorrect morphological information
provideduswithhiscode,andwebeganinvestigatingthis related often to word sense. We therefore began what
technologyandextendingitsscalabilityforapplicationto turnedouttobeasignificantefforttoconvertthewaythe
our question-answering problem. system did morphological processing to one that would
We used these two new datasets to augment the data also make use of the part of speech in its stemming pro-
that we previously had (and which we had been growing cess. We hope that this information will strengthen the
by hand throughout the year) to see if these approaches system at a later point even though it is currently not
would yield system improvements. We created two vari- embedded in the processing.
ant taxonomic dictionaries of different sizes and plugged Another effort which we were not able to finish
themintotheexistingsystem. InTable1,weillustratethe attemptedtoconvertQACTIS’scurrentprocessfromone
results of these variants as compared with our baseline that fuses all forms of annotation (named entity, parsing,
system that makes use of largely hand-assembled data. etc.)fromasinglestreamintoonewhereeachstreamcan
(Notethatthescoreslistedarethenumberofnumberone be accessed independently. This would allow the system
answers (#1) and the F-score for lists on past TREC sets to derive an answer from one stream even if another
withquestionidentifiersinthespecifiedranges. Inthese stream would have yielded some other interpretation.
evaluations, also, the system judgments are automatic, Thisnotionhaspotentiallyverypositivegains,butweare
and the judgments do not count off for unsupportedness currently at some distance away from knowing its long-
nor for temporal incorrectness. Moreover, the evaluation term benefits.
counts as correct any outputs which have exact answers Althoughthefull-blownmorphologyrevisionwasnot
embeddedtherein,butwhichmaynottrulybeexact. For incorporated by the time of evaluation, we were able to
example, “Coach Harold Solomon” would be scored as incorporate a weaker effort regarding morphology with
correct even if the exact form should only be “Harold regard to list-processing. The QACTIS system has been
Solomon.”) The table illustrates a disappointing result: developed primarily with a focus on factoid answering,
that despite this interesting effort, the taxonomy-growing but morphological structure of questions was not well
approach as it currently stands yielded slightly negative addressed for tackling lists. We therefore did work on
beefingthesystemupintermsofitslist-handlingcapabil- IRsystemforthe2005datasetwasmuchbetter--thetop
ityand,inthelongrun,thiseffortprovedtobequiteuse- 10 documents contained the correct answer 70% of the
ful in that our list-answering capability on the whole time while the top 30 documents contained the correct
improved substantially. answer 80% of the time.
ApseudoIRsetwasbuiltforthe1894seriesusingthe
2.2 Retrieving Documents answersetprovidedbyTREC-wereferredtothissetasa
’perfectIR’set. Thenumberofcorrectanswersprovided
Like many other question-answering systems, QACTIS
bythebasesystemwhenthisdatasetwasusedwas~60%
beginsitsquestion-answeringphasebyfirstattemptingto
-- an absolute gain of 20% just by removing irrelevant
find documents on the same subject We had used the
documents.
Lemur system [8], version 2.2, since TREC13, and have
An additional step was added to the overall system
found its results to be satisfactory. In fact, at TREC14,
that attempted to filter the IR using the named entities
usingthissystemoutoftheboxyieldedoneofthetopIR
present in the question. This list also includes dates,
systems. We experimented with new versions of Lemur,
titles,andanythinginquotes. Thisprocessdidprovidean
but were not able to get any better results for QA.
increase in scores as long as it was not overly aggressive
Even still, when we look at the results of our ques-
in filtering out too many documents.
tion-answerer,weseethatithasaless-than-perfectupper
Further attempts were made to include multi-word
bound due to limitations in information retrieval. If we
terms and low-frequency words (words in the question
could enhance our ability to identify appropriate docu-
whichhadalowerfrequencyofoccurrenceintheoverall
ments, we would likely have a higher performance and a
corpus) as filter terms, but there was not enough time to
higher upper bound on performance. We set out to
adequately analyze the effect. Additional parameters
improve our ability to preselect documents which would
such as how many of the top 1000 documents should we
hopefullycontainthedesiredanswers. Weexperimented
examine,howmanydocumentsshouldweretainandhow
withtwoapproaches. The firstof thesewasanapproach
manydocumentsfromtheoriginalIRshouldwekeepby
which identified key phrases from the question and tried
default also had to be factored in to the result.
to ensure that the returned documents actually contained
By the time of the TREC-2006 evaluation, it was
thosephrases. Wewillcallthisprocessphrase-basedfil-
determined that no more than the top 50 documents
tering of information retrieval. The second process used
shouldbeexamined. Therewasnodifferenceinoursys-
the Web as a mechanism for pseudo-relevance feedback.
tem in examining 50 or 75 documents. 100 documents
We discuss each of these techniques.
degradedtheoverallsystemperformance. Therewasalso
asignificantboostinlookingat50documentsasopposed
2.2.1 Phrase-based IR Filtering
to just 30. Also, because of list questions, it was deter-
By the time we had competed our system in the TREC-
mined arbitrarily that at least 10 documents should be
2005competition,thebasesystemasappliedtooneofthe
retained. Since our IR system showed the top 5 docs for
older TREC collections (the 1894 question series) was
each question to be relevant about 60% of the time, we
gettingmeanreciprocalranksofabout40%. Uponexam-
decidedtokeepthetop5documentsasamatterofcourse.
inationofthedocumentsreturnedbytheIRcomponent,it
Weranourphrase-basedfilteringonallofthecollec-
was discovered that a large number of irrelevant docu-
tions at our disposal on the day before the TREC-2006
ments were being returned. One reason for this was that
evaluation. Table 2 illustrates these results (whose scor-
peoples’ names, such as ’Virginia Woolf’, were broken
ing follows the paradigm mentioned in Table 1). As can
into two separate query terms. The resultant documents
be seen, this approach affords small (2.6% relative) but
returned some that contained ’Virginia Woolf’ as well as
positive improvements in the overall system performance.
some that related to a ’Bob Woolf’ who lived in ’Vir-
ginia’. A question pertaining to ’Ozzy Osborne’ also
returned a document containing a reference to a woman
Table 2: Filtered IR DevSet Improvements
who owned a dog named Ozzy which bit a ’Mrs.
Osborne’ on the wrist. Baseline w/ Filtered IR Diff in
QA Set
Furtheranalysis ofthe IRset showed thatthe top10 #1 Mrr List #1s
documentsforeachquestioncontainedthecorrectanswer
201-700 165 .432 172 .446 +7
55%ofthetime;thetop30documentscontainedthecor-
894-1393 116 .351 118 .348 +2
rect answer 67% of the time. (The numbers were deter-
1394-1893 133 .393 138 .402 +5
minedbycomparingthedocumentlistreturnedbytheIR
1894-2393 154 .431 .165 158 .444 .176 +4
system to the list of correct documents for each question
1.1-65.5 72 .395 .197 71 .402 .197 -1
as provided by the TREC competition committee.) The
66.1-140.5 121 .445 .163 127 .456 .162 +6
2.2.2. Google-Enhanced Information Retrieval recentyearsofTREC,ascomparedtothebaseline,theS2
Thesecondapproachtoprefilteringwasamultisteptech- system in preliminary tests yielded a 6.4% relative
niquethattookadvantageoftheInternetwithoutactually improvement in factoid performance and a 4.1% relative
trying to mine answers from it. This is a process which improvement in list performance.
apparentlyhasbeenusedbyotherTREC-QAparticipants
in past years. 2.2.3. A Word about Coupled Retrieval
To improve our document retrieval phase, we used One last experiment we tried was the coupling of these
Google to select augmentation terms for each question. two prefilters. The hope was that if each could give an
Each question in the test set was converted to a Google improvement, then perhaps in combination the improve-
query and submitted to Google using the Google API. ment would increase. Unfortunately, this was not the
The top 80 unique snippets returned for each question case. It turns out that the approaches are somewhat at
wereusedastheaugmentationcollection. Givenaques- odds with each other. The phrase-filtering approach
tion, we counted the number of times each snippet word attemptstoensurethatonlydocumentsthatcontainsome
co-occurredwithaquestionwordinthesnippetsentence. or all of the important question phrases should be
Thesesumsweremultipliedbytheinversedocumentfre- retained, while the Google-assisted approach attempts to
quencyoftheterm;documentfrequencieswerecalculated lookfordocumentsthatmighthaveterminologythatwas
fromtheAQUAINTcollection. Theresultingscoreswere notintheoriginalquestion. Ifoneappliesaphrase-based
usedtoranktheco-occurringterms. Thetopeightterms filtering system to documents that have been obtained by
were selected as augmentation terms, and were added to the Google-assisted process, the likelihood is that even
theoriginalquerywithweight1/3. Theresultingqueries fewerofthedocumentsthanbeforewillactuallyhavethe
were then used for document retrieval. appropriateterminology. Wetriedseveralothervariations
In selecting these parameters, we were faced with on this theme after the actual TREC submission, but no
many parameter choices for definition of co-occurrence, combination yielded an improvement in overall perfor-
term weighting, etc. With over a thousand combinations mance.
to choose from, it was not practical to run full question-
answeringtestsoneachonetoselectthebest.Instead,we 2.3. Beginning to Incorporate Statistical QA
used a proxy for question answering performance: the
In the past few TREC evaluations, there has been an
numberofwordsthatoccurinquestionanswersthatwere
emergenceofstatisticalQAsystemswhichhavetheprop-
selectedasexpansiontermsbythemethod. Weminedthe
erty that they learn relations between the posed question,
answer patterns from past TREC QA tracks for words.
theanswerpassage,andtheactualanswer. Then,whena
Each time a method selected one of these words as an
userposesafuturequestion,variousanswerpassagesare
expansion term for the training collection, it was given a
evaluated using statistical or machine-learning processes
point. WeusedthehighestscoringmethodinourTREC-
to determine how likely they are to contain a needed
2006 run.
answer. As a final step, the system must distill out the
answerfromtheexistingpassage. Statisticallearnersare
Table 3: Google-Enhanced Improvements
particularly appealing in that they hold potential capabil-
Google-Enhanced ity of developing language-independent QA. For such
QA Set BL
capability, one need only provide question-answer pairs
TF LS LT S S2
for training.
#1 154 156 130 156 152 157
Webeganin2005todevelopastatisticalQAsystem.
1894-2393 Mrr .431 .455 .377 .455 .442 .462
Theinfrastructureforthissystemwasinplacebythetime
List .165 .191 .148 .192 .179 .191
of the TREC-2006 evaluation and the system had begun
#1 72 76 80 81 76 79
to be taught how to automatically answer a limited num-
1.1-65.5 Mrr .395 .434 .405 .443 .434
ber of questions, so we thought we would couple it with
List .191 .176 .179 .182 .189 .195
theexistingsystemandallowittoanswerthosefewques-
#1 121 126 116 123 122 130
tionstructuresthatitwasequippedtoaddress. Sincethis
66.1-140.5 Mrr .445 .464 .473 .458 .463 .473
systemisnewandemerging,weprovideabitmoreinfor-
List .163 .158 .179 .159 .163 .168
mation about the process and ways that we attempted to
exploit the process during 2006.
Inourdevelopmentswiththisprocess,wewerequite
excitedaboutthegainswewereseeingwithit. Weexper-
2.3.1. From Document Selection to Passage Selection
imented with five different configurations of this process
Thefirststepoftheprocessofdevelopingastatisticalsys-
and one process (S2) yielded particularly successful
tem was to move from mere document selection to some
results. In Table 3, as can be seen from the three most
form of passage selection. The first 45 documents potentially language-independent process for answering
reported from the Lemur IR system were screened to the questions. Using a copy of Wikipedia, we first used
identifysentences(andsometimessurroundingsentences) named-entity matching and part of speech tagging to see
which reflected the information from the important noun if we could draw out an answer directly from the Wiki
wordsofthequestion. Thefirst80sentencesthatsatisfied pages. Barring this, we looked for redundant but nor-
the criteria were retained and the sentence selection pro- mallyrareinformationfromtheSVMsentencesand,ifit
cess was terminated. The 45-document and 80-sentence existed, this information was returned as the answer.
limits were determined to be empirically optimal thresh-
old. 3. SYSTEM EVALUATIONS
The next goal was to order these sentences by their
3.1. Description of Results
potential for answering the question. The reordering
In TREC-2006, we submitted three runs from among the
component was based on support vector machines
various configurations at our disposal. All of our runs
(SVMs). Thereorderingeffortwastreatedasatwo-way
used the same “Other” processing as in TREC-2005
classification problem -- a customary domain for SVMs.
(exceptthatthissystemwasslightlymorerobusttofailure
The classifier was based on 26-dimensional feature vec-
than last year). Also, in each situation, the system
tors that were drawn from the data. Examples of these
reported the top 20 answers from our factoid system as
features were: (a) reporting 1.0 if the direct object from
the“list”response. Intermsoffactoids,thefirstofthese
thequestionwasintheputativeanswersentenceand0.25
runs made use of our base engine but its information
otherwise; (b) reporting 1.0 if the direct object from the
retrievalphasewasprefilteredusingphrase-basedfiltering
question was missing but a WordNet synonym is in the
asmentionedbefore. Thesecondsystemreplacedphrase-
putative answer sentence, and 0.25 otherwise; (c) report-
based filtering with our Google-enhanced information
ing the ratio of question words to putative answer sen-
retrieval efforts. The third system was the same as the
tence words; and so forth. The classifier was then
second but whenever the statistical system was deemed
presentedwithpositiveexamplesoffeaturevectorsdrawn
itself able to answer the question, it would supplant the
from actual answer sentences of past-year TRECs and it
original answer with its own. Since the statistical system
wasalsopresentedwithacomparablenumberofnegative
isinitsinfancy,therewereveryfewanswersthatitactu-
examples drawn from bogus sentences.
ally supplanted.
Fromthesetrainingexamples,thesystemwastaught
The results of these runs are detailed in Table 4.
with quite good accuracy to learn the difference between
Under “Factoid,” the number of correct answers is listed
good question-answering sentences and poor ones. In
andisfollowedbythetriple(unsupported,inexact,tempo-
fact, for questions where the initial IR actually captured
rally-incorrect)andbythefractionoffirstplaceanswers.
relevant documents, the percentage of true answer sen-
Under the “List” and “Other” scores are the NIST-
tencesidentifiedbytheSVMonaheldoutTRECQAcol-
reported F-scores. The “All” category is the average of
lection was: 30% in the top-1 sentence, 43% in the top-2
thethreeprecedingcolumns,whichrepresentstheofficial
sentences,59%inthetop-5sentences,74%inthetop-10
NISTscore. Tooursurprise,noneofthesevariationspro-
sentences, and 80% in the top-15 sentences. It seemed
vided significantly different results in the “All” category.
highly likely that this process could afford a dramatic
However, it seems clear that factoids were negatively
improvement in overall performance.
impactedbyGoogle-enhancedIRascomparedtophrase-
based filtering, and the opposite is true for Lists.
2.3.2. Pulling out the Answers from the Sentences
Thenextissuewastoextracttheanswerfromtheanswer Table 4: TREC 2006 Performance
sentences. Onestrategywastoinsertthesesentencesinto
the existing question-answerer and hope it could distill Strategy Factoid List Other All
outtheanswer. Thisprocessyieldeda20%relativedeg-
Phrase-filtered IR 107 .147 .148 .185
radation in performance due largely to the fact that the
+ improved QA (10/20/5)
current system requires itself to find all relevant compo-
[#1] .266
nents of questions, whereas the best sentence may have
Google-enhanced 95 .156 .151 .181
the answer but not all relevant question components
IR, improved QA (14/22/4)
(neededforsupportedness). Althoughwewillattemptto
[#2] .236
modifythebasesystemduringtheremainderof2006and
2007totackletheproblem,therewasalsoadesiretogeta Google-enhanced 96 .156 .154 .183
fully-statistical QA system. IR, improved QA, (14/22/4)
Wehaveyettodevelopafullstatisticallearningpro- some statistical .238
cess for finding answers, but we did begin a simple and QA [#3]
3.2. Comparison to Past Years 3.3. Considering the “Other”s
With these scores seeming to be only marginally better At TREC-2005, the F=0.248 that our system generated
than they were last year, we wanted to determine if there was the maximum score for any of the “Other” question
had actually been any true system improvements since answerers. Thisyear,ourF-scoredroppedbyanabsolute
last year. We were able to identify our best competition 10%andourpositionfelltojustabovethemedian. Ifwe
systemsfrom2004and2005andconductasmallexperi- had made changes to our “Other” answerer, we would
ment to test for system improvements by applying these havebelievedthatwehadsimplyputforthapooreffortin
past systems to this year’s data. Our experiments would making changes. On the other hand, since the only
solely focus on factoids and we would not identify ques- change we made was to add a stage which would thrice
tionsforwhichanassessor,hadheorsheseentheoutput ensure against empty answers, we had to seek to under-
of the older system, may have judged an answer as cor- stand why the performance would have fallen as it had.
rect. Likewise,whereaswehadsomeparsingproblemsin At TREC-2004, our first year of participation, our
years past, we would allow the system to directly access system received a very high score and only 1/4 of the
the new and updated parses of today (since the broken answers were given zero credit (with half of these due to
and/or empty parses have long since been removed). It non-responsiveness of our system). In this year, though,
wasourexpectationthatthesetwooversightswouldlikely halfofouranswersweregivennocrediteventhoughour
balanceeachotherandprovideafairlyaccuratecompari- method for “Other”-answering is sort of a “kitchen sink”
son of past year performance to that of the current year. approachwhichreportstonsofinformation. Wetherefore
Additionally, since past-year systems were not con- reviewed the first ten of these zero-scored answers to see
cerned with temporally inaccurate answers, and since what would have changed.
“unsupported”isdifficulttotrulyjudgewithouttheinput Reviewingthe“Other”questionsisanon-trivialtask.
of an assessor, we scored the three systems by allowing Thecoreissuewiththeseisnotknowinghowthedetermi-
what TREC-2006 assessors had declared to be “Right,” nationismadeastowhethersomethingisvitalornot. It
“Locally Correct”, and “Correct but Unsupported.” The appeared this year that since there were so many ques-
following table provides the performance comparisons. tions asked from each series, the “Other” questions had
littleinformationtochoosefromthatwasbothnoveland
vital. Evenso,therearethreesituationsthatariseingiv-
Table 5: TREC 2006 vs. Past Years ingasystemazeroscoreinsuchantheevaluation:(a)the
QAsystemdidnotreturnitemsthatassessorsfoundtobe
Factoid
TREC Year valuable, (b) the QA system did return such items and
#R+L+U/“Correct” Score
receivednocredit,and(c)theQAsystemproduceditems
TREC-2006 System 122 / .303 that these assessors deemed to be non-vital but other
TREC-2005 System 95 / .236 assessorsmighthavebeenperceivedofasnuggets. Since
thetaskissubjective,anevaluationof“c”isnotaparticu-
TREC-2004 System 53/ .132
larlyhelpfuldirectiontostudy. Yetwewilltouchbriefly
(TREC-2006 w/o Filter) (116 / .288) on the first two of these given our in-depth review of the
ten zero-scored answers that we studied. (It should be
Table5showsagratifyingresult. Fromthefirstthree notedforreference,though,thatthereissomesubjectivity
rows we see that there have been reasonable improve- whichisinconsistent:in164.8,creditisgivenforthefact
mentstooursystemoverthecourseofthepasttwoyears. thatJudiDenchwasawardedOrderoftheBritishEmpire,
The last row of Table 5 is merely informational. but credit was not given in 153.8 for Alfred Hitchcock’s
Since we were only able to submit three runs to TREC, receiving of higher honors as a Knight Commander.)
wewerenotabletodeterminetheimpactofourbasicsys- By far the biggest problem for us with category “a”
temimprovementsasopposedtothosethatwerecoupled above was that what were being deemed as nuggets this
withIRimprovements. Withtheincorporationofthelast year were largely less-important pieces of information
row, we are able to see that the basic system improve- whichsurfacedtothetopasvitalsbecausemoreinterest-
mentscontributedtoabout21morerightanswersandthat ing information was posed as questions. If one were to
IRcontributedto6beyondthat. Wedidnotruntheexper- ask a system: “Tell me everything interesting and unique
iment of enhanced IR without the basic additions to the youcanaboutWarrenMoon”(Q141.8),onewouldexpect
system,butwhilewewereoriginallydevelopingthealgo- to receive information about his life profile: birth, death
rithms, this paradigm was tested and it was typically the (if appropriate), his profession, and other major suc-
case that IR improvements and basic improvements had cesses. SincetheQ141seriesasksabouthispositionona
1-3 correct answers in common. football team, the college where he played ball, his birth
year,histimeasaprobowler,hiscoaches,andtheteams
on which he played, the remaining relevant information
mustaddresshissuccessesandpossiblyhisdeath. Since
hehasnotdied,onlysuccessesremain. Thus,thefactthat Table 6: Unsupported Answers: Why?
hehadthethirdall-timehighestcareerpassingyards,and
Our Answer
a15-yearfootballcareerarenote-worthy. Evenso,ourIR QID Reason Unsupported
[Document]
prefilter rejected one of the documents containing one of
these items, and the other item appeared in a 17th-place 145.4 february23, Needed a conviction date. Docu-
document ... deeper than our Other processing typically 1999 mentdated3/2/1999refersto“last
goes. [NYT19990 week” which is ambiguous.
Furtherreviewinginthe“a”arena,wenotedthatvital 302.0069]
nuggetsfor152.7(Mozart)appearedinour17thand60th 154.4 Margot Needed most-frequent actress in
documents; a vital for 163.8 (Hermitage) was not in our Kidder Superman. Document states that
top100;thesolevitalfor164.8wasin47thplace;thetwo [APW19981 Kidder starred with Reeve, but
vitalsfrom175.5werein18thplaceandnon-top-100;and 213.1025] nothing about “most”
so forth. The absence of such information in the higher-
172.1 Burlington Needed a city and state of com-
IR documents was obviously a leading contributor to our
[NYT20000 pany’s origin. The answer is inex-
reducedscores. Inpastyears,therewerefewerquestions
124.0364] act, too, but this document only
asked per series, and many of those questions did not
says “based in,” not originated in.
focussomuchonkeyeventsofpeopleandorganizations
but were focused more on exercising the QA systems. 182.5 Scotland Needed country of Edinburg
These facts seem to be the primary reason why our [APW19990 Fringe. Document discusses poli-
“Other” results would be so drastically degraded. 506.0176] tics in Scotland and a “fringe
However, there are a few instances of “b” occurring party” -- polysemy problem.
as well. That is, our system reported vital information 188.1 California Needed US state with highest avo-
thatwasoverlooked. In163.8,oursystemreportedwith- [APW19990 cado production. California is
outcreditthattheHermitagewas“comparabletotheLou- 117.0079] only mentioned in passing and
vre in Paris or the Metropolitan Museum in New York,” nothing mentions production rates.
which was a specified vital nugget. Such issues are less 189.7 Edinburgh Needed city of JK Rowling in
frequent, though, and they are not unexpected given that [NYT20000 2000. Document states that she
our system reports tons of information as answers. Fur- 112.0203] lived in Edinburgh in 1993.
thermore, if the answers we perused are indicative, the Unclear if she lived there in 2000.
issue of vitals not receiving credit would probably con-
190.1 PITTS- Needed city of HJ Heinz. The
tribute less that .05 absolute to the current F-score.
BURGH byline gives Pittsburgh and dis-
[APW19990 cusses company profits, but does
3.4 Unsupportedness
615.0036] not explicitly say its base as there.
As mentioned previously, the large number of answers
from our system that assessors were tagging as “unsup- 191.3 Germany Neededcountrythatwonfirstfour
ported”seemedsomewhatsuspicioustousgiventhatour [XIE199906 IRFR World Cups. Document
system does not draw its answers from the Web. We 20.0031] mentions “Germany” and “four”
sought to review the answers being proposed by our sys- but not that Germany won 4 times.
tem and determine what the unsupported issues were. 194.3 six Need number of players at 1996
First, based on the cumulative information provided [XIE199603 WorldChessSuperTournament.
for all 59 competing system runs, we were able to deter- 20.0094] Documentmentions6players,but
mine that the average run had 12 answers that were wrong tournament and game.
declared to be inexact. We looked at our highest-scored 214.6 seven Need number of Miss America
factoid run and noted that we had 10 apparently unsup- [NYT20000 pageant judges. Document is on
ported answers. Although 10 was less than the average, 717.0370] Miss Texas Scholastic Pageant
we still wanted to understand the issues. We reviewed which had 7 judges.
each answer and found that all the answers were indeed
unsupported (and possible inexact as well). The table
below summarizes this information: the question number 4 FUTURE DIRECTIONS
(QID), the answer our system reported, and the reason
The future of QACTIS still holds a direction of multilin-
why the answer was unsupported.
gual and multimedia question-answering as a primary
goal. YetweanticipatefutureparticipationinTRECnext
yearuntilwehaveironedoutthewrinklesinoursystem.
Our focus on textual QA for the next year will be to
addresstheissuesthathaveyettobecompletedbutwhat
were mentioned in this paper, such as improvements and
exactnessfilteringusingmoremoderncontentextractors,
better incorporation of hypernyms, and making improve-
mentstoourstatisticalQAsystem. Wealsoplantomake
our baseline system cleaner and more robust.
5 REFERENCES
[1] Schone, P., Ciany, G., McNamee, P., Kulman, A.,
Bassi, T. , "Question Answering with QACTIS at
TREC 2004” The 13th Text Retrieval Conference
(TREC-2004),Gaithersburg,MD.NISTSpecialPub-
lication 500-261,2004.
[2] Schone, P., Ciany, G., Cutts, R.., McNamee, P., May-
field, J., Smith, T. , "QACTIS-based Question
Answering at TREC-2005” The 14th Text Retrieval
Conference (TREC-2005), Gaithersburg, MD, ,2005.
[3]BBN’s SERIFTM engine.
[4]MillerG.A.,BeckwithR.,FellbaumC.,GrossD.,and
Miller K. J. “WordNet: An online lexical database.”
InternationalJournalofLexicography3(4):235-244,
1990.
[5]P.Schone,J.Townsend,C.Olano,T.H.Crystal. “Text
Retrieval via Semantic Forests.” TREC-6, Gaithers-
burg, MD. NIST Special Publication 500-240, pp.
761-773, 1997
[6] www.wikipedia.org
[7] Snow, R., Jurafsky, D., and Ng, A.Y., "Learning syn-
tactic patterns for automatic hypernym discovery".
NIPS 2004.
[8] The LEMUR System. URL: http://www-
2.cs.cmu.edu/~lemur