Table Of ContentCriac¸a˜o de Le´xicos Bilingues para Traduc¸a˜o Automa´tica
Estat´ıstica
Lu´ıs Carlos Amado Magalha˜es Carvalho
Dissertac¸a˜oparaobtenc¸a˜odoGraudeMestreem
EngenhariaInforma´ticaedeComputadores
Ju´ri
Presidente: DoutoraMariadosReme´diosVazPereiraLopesCravo
Orientador: DoutoraMariaLu´ısaTorresRibeiroMarquesdaSilvaCoheur
Co-orientador: DoutoraIsabelMariaMartinsTrancoso
Vogal: DoutorBrunoEmanueldaGrac¸aMartins
Novembro 2010
Acknowledgements
To my beloved girlfriend Ana. Without her, I would have never finished this course. Her precious
advicesmademecarryontothisimportantgoalinmylife.
Tomymother,whoalwaystriedtopassserenityonme. ShealwayscookmyfavoritedishwhenI
wenttohavedinnerwithher: HERchickencurry:).
TomyfatherinBrazil,whoneverputextrapressureonme,onlyworriedwithmywellfair.
TomyTVstarsisterRitainNetherlandsandtomybrotherMarco,whoalwayswishedmeluckon
everysingleexam. Imissher.
TothemotherofmygirlfriendRosarinhawhoneverstoppedtoencourageme. Sheisalsoagreat
cook,andI’mlookingforwardtohernextcookingmeal:).
ToherotherdeardaughterJoana,whoneverstoppedbelievingme.SheisalsoagreatPokerplayer,
mostlybecauseeverychipssheearnsshegivestome:).
TomyprofessorLu´ısawhowasalwaysontopofeverything.Sheisagreatprofessor,funtobewith,
verystrictontimeschedules,verydemanding,butatthesametimeeasygoing. Sincethefirstinterview,
myimpressionwasthebestandIwasnotmistaken. Shealwayspointedtherightpathtome, andby
followingit,Imanagedtobesuccessful. Otherwise,IthinkImightnotbeabletofinishthiscourseever.
Ilearnedalotfromher. ManypeopletoldmethatIgotveryluckyonmyorientator. Iagree. IfIwere
beginningmythesis,IdefinitelywouldliketohaveanorientatorlikeLu´ısa.
TomygoodfriendRicardoandhiswifeAna,whomadethisSummer,despiteofhardwork,oneof
thebest. IthinktheseainCostadeCaparicamissesusevenmorethanwemissit:).
TomyfriendLuisandhisgirlfriendIneswhoineverysingleSaturdaymademeforgethowhard
wastoaccomplishthistaskbyridingonhisbikeat240Km/h:). Theyplantedmethebikesyndrome,
andnowIhavetogetoneofthosefastbabiestoo:). Theseaalsomissesthemalot.
TomycolleagueTiagoLu´ıswhohelpedmealotinL2F.Ifheisnotstoppedimmediately,withhis
selflessness,hemayfinishyourwholework,andwedonotwantthat:).
TotheoutstandingperformanceofmyfootballclubSLBenficainthepreviousseason. Itmademe
spendmanyjoyfulmoments,speciallytheonewith300thousandpeoplecelebratingthetitleinMarqueˆs
dePombal.
ToJamesHetfieldandMetallicaforperforminglivetomeinthelast4timesinPortugal. Itisnever
enough.
ToVirgemSutainthecarCD.
TotheColonelwhoinexplicablyfailedmeinmylastflightasairplanepilotinAFA.Icouldnotbe
moregratefultohim.
Lisboa,November22,2010
Lu´ısCarlosCarvalho
TomybelovedgirlfriendAnaandtomyfamilyandfriends.
It is better to reign in Hell than to
be slave in Heaven.
Resumo
Apesquisaefectuadanocontextodestetrabalhoresultounodesenvolvimentodeumaframeworkpara
detecc¸a˜o de palavras cognatas entre diferentes l´ınguas. A framework centra-se em medidas de simi-
laridadeentrepalavraseregrasdetransliterac¸a˜o. Adetecc¸a˜odecognatasfoifeitaemduasfases: pre-
processamentoeclassificac¸a˜o. Afasedepreprocessamentoapenasusouumsubconjuntodasmedidas
de similaridade por forma a descartar pares de palavras que na˜o partilhavam qualquer semelhanc¸a.
As medidas foram Word Length, Lcsm, Lcsr, Jaro Winkler e Sequence Letters. Os pares resultantes foram
enta˜oaproveitadosparaaprimeirafasedeclassificac¸a˜o: otreino. Estafasepermitiugerarummodelo
baseadonasmedidasdesimilaridade. Estemodeloe´ utilizadoparapreverseaspalavrassa˜ocognatas.
Detodasasmedidasdesimilaridade,apenastreˆssaousadas: Lcsm,LevenshteineDice. Apartirdestas
medidas, o mo´dulo de cognatas atingiu uma F-measure de 66.93%. Apo´s a construc¸a˜o da framework,
estafoiusadaparadetecc¸a˜odetraduc¸o˜esdeentidadesmencionadas. Estesegundomo´dulousoutreˆs
reconhecedores de entidades mencionadas: Stanford NER para nomes escritos na l´ıngua inglesa, XIP
NER e um me´todo adaptativo para nomes em portugueˆs. Dois me´todos foram utilizados: o primeiro
usouoStansfordNERcomoXIPNER.OsegundoutilizouoStanfordNERmaisome´todoadaptativo.
O primeiro alcanc¸ou F-measure de 62.65%, enquanto que o segundo me´todo revelou-se mais eficiente
tendoatingidoF-measurede73.91%.
Abstract
Theresearchperformedinthecontextofthisthesisresultedinthedevelopmentofaframeworkforthe
detectionofcognatesacrosstextsofdifferentlanguages. Theframeworkiscenteredinwordsimilarity
measuresandtransliterationrules. Cognatedetectionwasaccomplishedintwophases: preprocessing
andclassification. Thepreprocessingphaseusedonlyasubsetofthewholesetofsimilaritymeasures
in order to discard pairs of words that did not share any resemblance. The measures used were Word
Length, Lcsm, Lcsr, Jaro Winkler and Sequence Letters. Furthermore, the resulting pairs were used in the
firststepofclassification:training.Trainingpermittedtogenerateamodelbasedonsimilaritymeasures.
Thismodelisfurtherusedtopredictwhetherwordsarecognates. Fromthewholesetofsimilaritymea-
sures,themodelusedonlythree: Lcsm,DiceandLevenshtein. Fromthesemeasures,thecognatemodule
producedaF-measurerateof66.93%. Aftertheframeworkwasbuilt,itwasusedtodetecttranslations
ofnamedentities. Thismoduleusedthreenamedentityrecognizers: StanfordNERforEnglishnames,
XIPNERandanAdaptiveMethodtoacquirePortuguesenamedentities. Twoapproacheswereused:
firstStanfordNERwasusedplustheXIPNER.ThesecondapproachconsistedintheuseoftheStanford
NERagainsttheAdaptiveMethod. ThefirstapproachhadF-measurerateof62.65%,whilstthesecond
onewasmoreefficient,73.91%ofF-measurerate.
Description:This model is further used to predict whether words are cognates. From the whole
set of similarity mea- .. 2.3 Vector comparison using seed words as context .