Table Of Content

BBrriigghhaamm YYoouunngg UUnniivveerrssiittyy BBYYUU SScchhoollaarrssAArrcchhiivvee Theses and Dissertations 2015-12-01 FFaacciilliittaattiinngg CCoorrppuuss AAnnnnoottaattiioonn bbyy IImmpprroovviinngg AAnnnnoottaattiioonn AAggggrreeggaattiioonn Paul L. Felt Brigham Young University - Provo Follow this and additional works at: https://scholarsarchive.byu.edu/etd Part of the Computer Sciences Commons BBYYUU SScchhoollaarrssAArrcchhiivvee CCiittaattiioonn Felt, Paul L., "Facilitating Corpus Annotation by Improving Annotation Aggregation" (2015). Theses and Dissertations. 5678. https://scholarsarchive.byu.edu/etd/5678 This Dissertation is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more information, please contact [email protected], [email protected]. FacilitatingCorpusAnnotationbyImproving AnnotationAggregation PaulL.Felt Adissertationsubmittedtothefacultyof BrighamYoungUniversity inpartialfulfillmentoftherequirementsforthedegreeof DoctorofPhilosophy EricK.Ringger,Chair KevinSeppi ChristopheGiraud-Carrier DeryleW.Lonsdale QuinnSnell DepartmentofComputerScience BrighamYoungUniversity December2015 Copyright c 2015PaulL.Felt � AllRightsReserved ABSTRACT FacilitatingCorpusAnnotationbyImproving AnnotationAggregation PaulL.Felt DepartmentofComputerScience,BYU DoctorofPhilosophy Annotated text corpora facilitate the linguistic investigation of language as well as the automationofnaturallanguageprocessing(NLP)tasks. NLPtasksincludeproblemssuchasspam email detection, grammatical analysis, and identifying mentions of people, places, and events in text. However,constructinghighqualityannotatedcorporacanbeexpensive. Costcanbereduced by employing low-cost internet workers in a practice known as crowdsourcing, but the resulting annotations are often inaccurate, decreasing the usefulness of a corpus. This inaccuracy is typically mitigatedbycollectingmultipleredundantjudgmentsandaggregatingthem(e.g.,viamajorityvote) toproducehighqualityconsensusanswers. Weimprovethequalityofconsensuslabelsinferredfromimperfectannotationsinanumber ofways. Weshowthattransferlearningcanbeusedtoderivebenefitfromout-datedannotations which would typically be discarded. We show that, contrary to popular preference, annotation aggregationmodelsthattakeagenerativedatamodelingapproachtendtooutperformthosethattake aconditionapproach. Weleveragethisinsighttodevelop CSLDA,anovelannotationaggregation model that improves on the state of the art for a variety of annotation tasks. When data does not permit generative data modeling, we identify a conditional data modeling approach based on vector-spacetextrepresentationsthatachievesstate-of-the-artresultsonseveralunusualsemantic annotation tasks. Finally, we identify a family of models capable of aggregating annotation data containingheterogenousannotationtypessuchaslabelfrequenciesandlabeledfeatures. Wepresent amultiannotatoractivelearningalgorithmforthismodelfamilythatjointlyselectsanannotator, dataitems,andannotationtype. Keywords: crowdsourcing,corpusannotation,semanticembeddings,LDA,richpriorknowledge ACKNOWLEDGMENTS Firstandforemost,thankyou,Stephanie. Youhavebeentirelesslysupportivefrombeginning to end. You havebeen a soundingboard forideas, an emotional coach, acopy editor,and a friend through the long process of completing a dissertation. In many ways an acknowledgment seems likean insufficientway todescribeyour role—thereality iscloser to jointauthorship. Thanksalso go to my children, Jane, Nathaniel, Gabriel, and Mackay who have grudgingly allowed me to work many late nights, but never without remindingme thattime is precious. Similar thanksgo to my parents, Doug and Shelley, as well as my parents-in-law, Scott and Jane. Without their constant supportthisworkwouldnothavebeenpossible. My advisor, Dr. Eric Ringger, has been instrumental in helping me develop the mental tools necessary to do this work. His remarkable attention to detail and rigor of thought were what originally attracted me to the field of natural language processing. I have benefited greatly from his passion for exploring the unknown and expanding the intellectual, geographical, and culinary horizons of himself and those around him. His role in this work cannot be overstated. Dr. Kevin Seppi has been similarly influential, consistently willing to drop whatever he was doing in order to discuss a new idea. I would also like to thank Dr. Jordan Boyd-Graber for his close collaboration on the ideas that went into Chapter 5 related to the CSLDA model. My committee also deserves a good deal of t hanks for repeatedly providing valuable ideas and feedback, giving perspective, and helping my work stay focused in useful directions. Also, many thanks go to fellow students who have, surprisingly, been among my most effective teachers; in particular, Robbie Haertel, Dan Walker, Kevin Black, and Jeff Lund. Thanks go to the Fulton Supercomputing Lab for providing the computational resources supportinganumberoftheexperimentsinthiswork. Finally,thisworkwaspartlysupportedbythecollaborative NSF Grant IIS-1409739(BYU) and IIS-1409287(UMD). Anyopinions,findings,conclusions,orrecommendationsexpressedhere arethoseoftheauthorsanddonotnecessarilyreflecttheviewofthesponsor. TableofContents 1 Introduction 1 1.1 OverviewofCorpusAnnotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 OpportunitiestoImproveCorpusAnnotation . . . . . . . . . . . . . . . . . . . . 3 1.3 ThesisStatement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 DissertationOrganization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 UsingTransferLearningtoAssistExploratoryCorpusAnnotation 7 2.1 ExploratoryCorpusAnnotation(ECA) . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 PreviousWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 TransferLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 ECAasTransferLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Baselines: TGTTRAINandALLTRAIN . . . . . . . . . . . . . . . . . . 12 2.3.2 STACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3 AUGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 ConclusionsandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 MOMRESP:ABayesianModelforMulti-AnnotatorDocumentLabeling 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 PreviousWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 v 3.3.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.3 ClassCorrespondenceCorrection . . . . . . . . . . . . . . . . . . . . . . 25 3.3.4 LossFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.1 ClassCorrespondenceCorrection . . . . . . . . . . . . . . . . . . . . . . 31 3.4.2 InferredLabelAccuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.3 AnnotatorErrorEstimation . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.4 FailureCases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 ConclusionsandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4 Early Gains Matter: A Case for Preferring Generative over Discriminative Crowd- sourcingModels 38 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 PreviousWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.1 Log-lineardatamodel(LOGRESP) . . . . . . . . . . . . . . . . . . . . . . 42 4.3.2 Multinomialdatamodel(MOMRESP) . . . . . . . . . . . . . . . . . . . . 43 4.3.3 AGenerative-DiscriminativePair . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Mean-fieldVariationalInference(MF) . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4.1 LOGRESP Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.2 MOMRESP Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.3 Modelpriorsandimplementationdetails . . . . . . . . . . . . . . . . . . 47 4.5 ExperimentswithSimulatedAnnotators . . . . . . . . . . . . . . . . . . . . . . . 48 4.5.1 SimulatingAnnotators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5.2 DatasetsandFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.3 ValidatingMean-fieldVariationalInference . . . . . . . . . . . . . . . . . 50 4.5.4 Discriminative(LOGRESP)versusGenerative(MOMRESP) . . . . . . . . 51 4.6 ExperimentswithHumanAnnotators . . . . . . . . . . . . . . . . . . . . . . . . 54 vi 4.7 ConclusionsandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5 Making the Most of Crowdsourced Document Annotations: Confused Supervised LDA 56 5.1 ModelingAnnotatorsandAbilities . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2 LatentRepresentationsthatReflectLabelsandConfusion . . . . . . . . . . . . . . 57 5.2.1 LeveragingData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.2 ConfusedSupervisedLDA(CSLDA) . . . . . . . . . . . . . . . . . . . . 59 5.2.3 StochasticEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.4 HyperparameterOptimization . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.1 Human-generatedAnnotations . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.2 SyntheticAnnotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.3 JointvsPipelineInference . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.4 ErrorAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4 AdditionalRelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.5 ConclusionandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6 SemanticAnnotationAggregationwithConditionalCrowdsourcingModelsandWord Embeddings 74 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2.1 Data-awareannotationmodels . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2.2 WordandDocumentRepresentations . . . . . . . . . . . . . . . . . . . . 78 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.3.2 Comparisonwithlexicalmethods . . . . . . . . . . . . . . . . . . . . . . 82 6.3.3 Whenlexicalmethodsdonotapply . . . . . . . . . . . . . . . . . . . . . 83 vii 6.3.4 Summaryofexperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.4 Sentimentdataseterroranalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.5 AdditionalRelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.6 ConclusionsandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7 Learning from Measurements in Crowdsourcing Models: Inferring Ground Truth fromDiverseAnnotationTypes 90 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2 BackgroundonMeasurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.3 Multi-annotatorMeasurementsArchitecture . . . . . . . . . . . . . . . . . . . . . 94 7.4 Per-annotatorNormalMeasurementModelforClassification . . . . . . . . . . . . 96 7.4.1 ImplementationConsiderations . . . . . . . . . . . . . . . . . . . . . . . 99 7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.5.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.5.2 SimulatedData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.5.3 SentimentClassification . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.6 ModelExtensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.6.1 ActiveMeasurementSelection . . . . . . . . . . . . . . . . . . . . . . . . 104 7.6.2 LabeledLocationMeasurements . . . . . . . . . . . . . . . . . . . . . . . 106 7.7 AdditionalRelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.8 ConclusionandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 8 ConclusionsandFutureWork 111 9 Appendix: SupplementaryMaterialfor EarlyGainsMatter: ACaseforPreferringGenerativeoverDiscriminativeCrowdsourc- ingModels 114 viii 10 Appendix: SupplementaryMaterialfor LearningfromMeasurementsinCrowdsourcingModels 126 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 10.2 VariationalInference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 10.2.1 JointProbability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 10.2.2 Meanfieldupdates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 10.2.3 LowerBound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 10.3 CalculatingExpectedValues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 10.4 PropertiesofExpectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 11 Appendix: Deriving the majority vote procedure from an item response model under limitingassumptions 139 References 142 ix

Description:

email detection, grammatical analysis, and identifying mentions of people, .. 6 Semantic Annotation Aggregation with Conditional Crowdsourcing

Facilitating Corpus Annotation by Improving Annotation Aggregation PDF

164 Pages·2016·1.78 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Download Facilitating Corpus Annotation by Improving Annotation Aggregation PDF Free - Full Version

by Unknow| 2016| 164 pages| 1.78| English

Download Facilitating Corpus Annotation by Improving Annotation Aggregation by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Facilitating Corpus Annotation by Improving Annotation Aggregation

email detection, grammatical analysis, and identifying mentions of people, .. 6 Semantic Annotation Aggregation with Conditional Crowdsourcing

Detailed Information

Author:	Unknown
Publication Year:	2016
Pages:	164
Language:	English
File Size:	1.78
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Facilitating Corpus Annotation by Improving Annotation Aggregation Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Facilitating Corpus Annotation by Improving Annotation Aggregation PDF?

Yes, on https://PDFdrive.to you can download Facilitating Corpus Annotation by Improving Annotation Aggregation by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Facilitating Corpus Annotation by Improving Annotation Aggregation on my mobile device?

After downloading Facilitating Corpus Annotation by Improving Annotation Aggregation PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Facilitating Corpus Annotation by Improving Annotation Aggregation?

Yes, this is the complete PDF version of Facilitating Corpus Annotation by Improving Annotation Aggregation by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Facilitating Corpus Annotation by Improving Annotation Aggregation PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.