Table Of ContentA Multifaceted Evaluation of Neural versus Phrase-Based Machine
Translation for 9 Language Directions
AntonioToral∗ V´ıctorM.Sa´nchez-Cartagena
UniversityofGroningen PrompsitLanguageEngineering
TheNetherlands Av. Universitats/n. EdificiQuorumIII
[email protected] E-03202Elx,Spain
[email protected]
Abstract different weight by a tuning algorithm. On the
contrary, in NMT all the components are jointly
We aim to shed light on the strengths and
trainedtomaximisetranslationquality. NMTsys-
weaknesses of the newly introduced neu-
7 tems have a strong generalisation power because
1 ral machine translation paradigm. To that they encode translation units as numeric vectors
0 end,weconductamultifacetedevaluation
thatrepresentconcepts,whereasinPBMTtransla-
2
inwhichwecompareoutputsproducedby
tionunitsareencodedasstrings. Moreover,NMT
n
state-of-the-art neural machine translation
a systems are able to model long-distance phenom-
J andphrase-basedmachinetranslationsys- enathankstotheuseofrecurrentneuralnetworks,
1 tems for 9 language directions across a
e.g. long short-term memory (LSTM) (Hochre-
1
number of dimensions. Specifically, we
iter and Schmidhuber, 1997) or gated recurrent
] measurethesimilarityoftheoutputs,their units(Chungetal.,2014).
L
fluency and amount of reordering, the ef-
The translations produced by NMT systems
C
fect of sentence length and performance
have been evaluated thus far mostly in terms of
.
s across different error categories. We find
overall performance scores, be it by means of au-
c
[ out that translations produced by neural tomatic or human evaluations. This has been the
1 machine translation systems are consider- case of last year’s news translation shared task
v ably different, more fluent and more ac- at the First Conference on Machine Translation
1 curate in terms of word order compared (WMT16).1 In this translation task, outputs pro-
0
to those produced by phrase-based sys-
9 duced by participant MT systems, the vast ma-
2 tems. Neural machine translation sys- jority of which fall under either the phrase-based
0 tems are also more accurate at producing
or neural approaches, were evaluated (i) automat-
.
1 inflected forms, but they perform poorly
ically with the BLEU (Papineni et al., 2002) and
0
whentranslatingverylongsentences.
7 TER (Snover et al., 2006) metrics, and (ii) manu-
1 allybymeansofrankingtranslations(Federmann,
1 Introduction
:
v 2012) and monolingual semantic similarity (Gra-
Xi A new paradigm to statistical machine transla- hametal.,2016). Inalltheseevaluations,theper-
tion, neural MT (NMT), has emerged very re- formanceofeachsystemismeasuredbymeansof
r
a
cently and has already surpassed the performance anoverallscore,which,whilegivinganindication
of the mainstream approach in the field, phrase- ofthegeneralperformanceofagivensystem,does
based MT (PBMT), for a number of language notprovideanyadditionalinformation.
pairs, e.g. (Sennrich et al., 2015; Luong et al., In order to understand better the new NMT
2015; Costa-Jussa` and Fonollosa, 2016; Chung et paradigm and in what respects it provides bet-
al.,2016). ter(orworse)translationqualitythanstate-of-the-
In PBMT (Koehn, 2010) different models art PBMT, Bentivogli et al. (2016) conducted a
(translation, reordering, target language, etc.) are detailed analysis for the English-to-German lan-
trained independently and combined in a log- guage direction. In a nutshell, they found out
linear scheme in which each model is assigned a thatNMT(i)decreasespost-editingeffort,(ii)de-
∗WorkpartlydoneathispreviouspositioninDublinCity 1http://www.statmt.org/wmt16/
University,Ireland. translation-task.html
gradesfasterthanPBMTwithsentencelengthand that the systems evaluated are state-of-the-art, as
(iii)resultsinanotableimprovementregardingre- theyaretheresultofthelatestdevelopmentsattop
ordering. MT research groups worldwide, and (iii) allows
In this paper we delve further in this direc- the conclusions that will be drawn to be rather
tionbyconductingamultilingualandmultifaceted general, as 6 languages from 4 different families
evaluation in order to find answers to the follow- (Germanic,Slavic,RomanceandFinno-Ugric)are
ingresearchquestions. Whether,incomparisonto coveredintheexperiments.
PBMT,NMTsystemsresultin: The rest of the paper is organised as follows.
Section 2 describes the experimental setup. Sub-
• considerably different output and higher de-
sequent sections cover the experiments carried
greeofvariability;
out in which we measured different aspects of
NMT, namely: output similarity (Section 3), flu-
• moreorlessfluentoutput;
ency(Section4),degreeofreorderingandquality
• moreorlessmonotonetranslations; of word order (Section 5), sentence length (Sec-
tion 6), and amount of errors for different error
• translationswithbetterorworsewordorder;
categories(Section7). Finally,Section8holdsthe
conclusionsandproposalsforfuturework.
• better or worse translations depending on
sentencelength;
2 ExperimentalSetup
• less or more errors for different error cate-
The experiments are run on the best2 PBMT3 and
gories: inflectional,reorderingandlexical;
NMT constrained systems submitted to the news
translation task of WMT16. Out of the 12 lan-
Hereunder we specify the main differences and
guage directions at the translation task, we con-
similarities between this work and that of Ben-
duct experiments on 9.4 These are the language
tivoglietal.(2016):
pairsbetweenEnglish(EN)andCzech(CS),Ger-
• Language directions. They considered 1 man(DE),Finnish(FI),Romanian(RO)andRus-
whileourstudycomprises9. sian (RU) in both directions (except for Finnish,
whereonlytheEN→FIdirectioniscoveredasno
• Content. They dealt with transcribed
NMT system was submitted for the opposite di-
speeches while we work with news stories.
rection, FI→EN). Finally, there was an additional
Previous research has shown that these two
language at the shared task, Turkish, that is not
typesofcontentposedifferentchallengesfor
consideredhere,aseithernoneofthesystemssub-
MT(RuizandFederico,2014).
mitted was neural (Turkish→EN), or there was
one such system but its performance was ex-
• Sizeofevaluationdata. Theirtestsethad600
tremelylow(EN→Turkish)andhencemostprob-
sentenceswhileourtestsetsspanfrom1999
ably not representative of the state-of-the-art in
to3000dependingonthelanguagedirection.
NMT.
• Reference type. Their references were both Table 1 shows the main characteristics of the
independent from the MT output and also
2According to the human evaluation (Bojar et al., 2016,
post-edited,whilewehaveaccessonlytosin-
Sec. 3.4). Whentherearenotstatisticallysignificantdiffer-
gleindependentreferences. encesbetweentwoormoreNMTorPBMTsystems(i.e.they
belongtothesameequivalenceclass),wepicktheonewith
• Analyses. While some analyses overlap, thehighestBLEUscore.IftwoNMTorPBMTsystemswere
thebestaccordingtoBLEU(draw),wepicktheonewiththe
some are novel in our experiments. Namely,
bestTERscore.
output similarity, fluency and degree of re- 3Many of the PBMT systems contain neural features,
orderingperformed. mainly in the form of language models. If the best PBMT
submission contains any neural features we use this as the
PBMTsysteminouranalysesaslongasnoneofthesefea-
Our analyses are conducted on the best PBMT
turesisafully-fledgedNMTsystem.Thiswasthecaseofthe
andNMTsystemssubmittedtotheWMT16trans- best submission in terms of BLEU for RU→EN (Junczys-
lation task for each language direction. This (i) Dowmuntetal.,2016)
4Someexperimentsarerunonasubsetoftheselanguages
guarantees the reproducibility of our results as all
due to the lack of required tools for some of the languages
theMToutputsarepubliclyavailable,(ii)ensures involved.
Language MT Systemdetails
Pair Paradigm
PBMT Phrase-based,wordclusters(Dingetal.,2016)
EN→CS
NMT Unsupervised word segmentation and backtranslated monolingual cor-
pora(Sennrichetal.,2016)
hierarchical String-to-tree, neural and dependency language models (Williams et al.,
EN→DE
PBMT 2016)
NMT SameasforEN→CS
PBMT Phrase-based, rule-based and unsupervised word segmentation, opera-
EN→FI
tion sequence model (Durrani et al., 2011), bilingual neural language
model (Devlin et al., 2014), re-ranked with a recurrent neural language
model(Sa´nchez-CartagenaandToral,2016)
NMT Rule-based word segmentation, backtranslated monolingual cor-
pora(Sa´nchez-CartagenaandToral,2016)
PBMT Phrased-based, operation sequence model, monolingual and bilingual
EN→RO
neurallanguagemodels(Williamsetal.,2016)
NMT SameasforEN→CS
PBMT Phrase-based,wordclusters,bilingualneurallanguagemodel(Dingetal.,
EN→RU
2016)
NMT SameasforEN→CS
PBMT SameasforEN→CS
CS→EN
NMT SameasforEN→CS
PBMT Phrase-based,pre-reordering,compoundsplitting(Williamsetal.,2016)
DE→EN
NMT SameasforEN→CSplusrerankedwitharight-to-leftmodel
PBMT Phrase-based, operation sequence model, monolingual neural language
RO→EN
model(Williamsetal.,2016)
NMT SameasforEN→CS
PBMT Phrase-based,lemmasinwordalignment,sparsefeatures,bilingualneural
RU→EN
languagemodelandtransliteration(Loetal.,2016)
NMT SameasforEN→CS
Table 1: Details of the best systems pertaining to the PBMT and NMT paradigms submitted to the
WMT16newstranslationtaskforeachlanguagedirection.
best PBMT and NMT systems submitted to the 3 OutputSimilarity
WMT16newstranslationtask. Itshouldbenoted
Theaimofthisanalysisistoassesstowhichextent
that all the NMT systems listed in the table fall
translationsproducedbyNMTsystemsarediffer-
under the encoder-decoder architecture with at-
ent from those produced by PBMT systems. We
tention (Bahdanau et al., 2015) and operate on
measure this by taking the outputs of the top n6
subword units. Word segmentation is carried out
NMT and PBMT systems submitted to each lan-
with the help of a lexicon in the EN→FI di-
guagedirection7andcheckingtheirpairwiseover-
rection (Sa´nchez-Cartagena and Toral, 2016) and
lap in terms of the chrF1 (Popovic´, 2015) auto-
in an unsupervised way in the remaining direc-
maticevaluationmetric.8
tions(Sennrichetal.,2016).
We would consider NMT outputs considerably
different (with respect to PBMT) if they resem-
2.1 OverallEvaluation
bleeachother(i.e. highpairwiseoverlapbetween
NMT outputs) more than they do to PBMT sys-
First, and in order to contextualise our analyses
tems(i.e. lowoverlapbetweenanoutputbyNMT
below,wereporttheBLEUscoresachievedbythe
andanotherbyPBMT).Thisanalysisiscarriedout
best NMT and PBMT systems for each language
onlyforlanguagedirectionsoutofEnglish,asfor
directionatWMT16’snewstranslationtaskinTa-
allthelanguagedirectionsintoEnglishtherewas,
ble2.5 ThebestNMTsystemclearlyoutperforms
atmost,1NMTsubmission.
the best PBMT system for all language directions
out of English (relative improvements range from
TL 2NMT 2PBMT NMT&PBMT
5.5% for EN→RO to 17.6% for EN→FI) and the
CS 68.66 77.63 64.34
human evaluation (Bojar et al., 2016, Sec. 3.4)
DE 72.10 72.97 66.80
confirms these results. In the opposite direction,
FI 56.03 57.42 55.55
the human evaluation shows that the best NMT
RO 69.47 75.96 68.77
systemoutperformsthebestPBMTsystemforall
RU 35.52 43.35 29.87
language directions except when the source lan-
guage is Russian. This slightly differs from the Table3: Averageoftheoverlapsbetweenpairsof
automatic evaluation, according to which NMT outputs produced by the top n NMT and PBMT
outperforms PBMT for translations from Czech systems for each language direction from English
(3.3% relative improvement) and German (9.9%) to the target language (TL). The higher the value,
but underperforms PBMT for translations from thelargeristheoverlap.
Romanian(-3.7%)andRussian(-3.8%).
Table 3 shows the results. We can observe
System CS DE FI RO RU the same trends for all the language directions,
FromEN namely: (i)thehighestoverlapsarebetweenpairs
PBMT 23.7 30.6 15.3 27.4 24.3 of PBMT systems; (ii) next, we have overlaps
NMT 25.9 34.2 18.0 28.9 26.0 between NMT systems; (iii) finally, overlaps be-
IntoEN tweenPBMTandNMTarethelowest.
PBMT 30.4 35.2 23.7 35.4 29.3 We can conclude then that NMT systems lead
NMT 31.4 38.7 - 34.1 28.2
6Thenumberofsystemsconsideredisdifferentforeach
language direction as it depends on the number of systems
Table2: BLEUscoresofthebestNMTandPBMT
submitted.Namely,wehaveconsidered2NMTand2PBMT
systemsforeachlanguagepairatWMT16’snews intoCzech,3NMTand5PBMTintoGerman,2NMTand4
PBMTintoFinnish,2NMTand4PBMTintoRomanianand
translationtask. Ifthedifferencebetweenthemis
2NMTand3PBMTintoRussian.
statistically significant according to paired boot- 7In order to make sure that all systems considered are
strapresampling(Koehn,2004)withp = 0.05and trulydifferent(ratherthandifferentrunsofthesamesystem)
weconsideronly1systemperparadigm(NMTandPBMT)
1000iterations,thehighestscoreisshowninbold.
submittedbyeachteamforeachlanguagedirection.
8Throughoutouranalysesweusethismetricasithasbeen
shown to correlate better with human judgements than the
de facto standard automatic metric, BLEU, when the target
5Wereporttheofficialresultsfromhttp://matrix. languageisamorphologicallyrichlanguagesuchasFinnish,
statmt.org/matrixforthetestsetnewstest2016using whileitscorrelationisonparwithBLEUforlanguageswith
normalisedBLEU(columnzBLEU-cased-norm). simplermorphologysuchasEnglish(Popovic´,2015).
to considerably different outputs compared to Language
PBMT NMT Rel. diff.
PBMT. The fact that there is higher inter-system direction
variability in NMT than in PBMT (i.e. over- EN→CS 202.91 173.33 −14.58%
laps between pairs of NMT systems are lower EN→DE 131.54 107.08 −18.60%
than between pairs of PBMT systems) may sur- EN→FI 214.10 222.40 3.88%
prisethereader,consideringthefactthatallNMT EN→RO 124.66 116.33 −6.68%
systems belong to the same paradigm (encoder- EN→RU 158.18 127.83 −19.19%
decoder with attention) while for some language CS→EN 110.08 102.36 −7.01%
directions(EN→DE,EN→FIandEN→RO)there DE→EN 122.26 104.72 −14.35%
are PBMT systems belonging to two different RO→EN 106.08 102.18 −3.68%
paradigms (pure phrase-based and hierarchical). RU→EN 123.86 106.75 −13.81%
However, the higher variability among NMT Average 143.74 129.22 −10.45%
translations can be attributed, we believe, to the
Table 4: Perplexity scores for the outputs of the
fact that NMT systems use numeric vectors that
best NMT and PBMT systems on language mod-
representconceptsinsteadofstringsastranslation
elsbuilton4millionsentencesrandomlyselected
units.
fromtheNewsCrawl2015corpora.
4 Fluency
outputs produced by NMT systems are, in gen-
In this experiment we aim to find out whether
eral, more fluent than those produced by PBMT
the outputs produced by NMT systems are more
systems.
or less fluent than those produced by PBMT sys-
One may argue that the perplexity obtained for
tems. To that end, we take perplexity of the
NMT outputs is lower than that for PBMT out-
MT outputs on neural language models (LMs) as
puts because the LMs we used to measure per-
a proxy for fluency. The LMs are built using
plexity follow the same model as the decoder of
TheanoLM (Enarvi and Kurimo, 2016). They
theNMTarchitecture(Bahdanauetal.,2015)and
contain100unitsintheprojectionlayer,300units
hence perplexity on a neural LM is not a valid
intheLSTMlayer,and300unitsinthetanhlayer,
proxy for fluency. However, the following facts
following the setup described by Enarvi and Ku-
supportourstrategy:
rimo (2016, Sec. 3.2). The training algorithm is
Adagrad (Duchi et al., 2011) and we used 1000 • Themanualevaluationoffluencycarriedout
wordclassesobtainedwithmkclsfromthetrain- at the WMT16 shared translation task (Bo-
ing corpus. Vocabulary is limited to the most fre- jar et al., 2016, Sec. 3.5) already confirmed
quent50000tokens. thatNMTsystemsconsistentlyproducemore
LMs are trained on a random sample of 4 mil- fluenttranslationsthanPBMTsystems. That
lionsentencesselectedfromtheNewsCrawl2015 manualevaluationonlycoveredlanguagedi-
monolingual corpora, available for all the lan- rections into English. In this experiment, we
guagesconsidered.9 extendthatconclusiontolanguagedirections
Table 4 shows the results. For all the language outofEnglish.
directionsconsideredbutone,perplexityishigher
• Neural LMs consistently outperform n-gram
on the PBMT output compared to the NMT out-
basedLMswhenassessingthefluencyofreal
put. TheonlyexceptionistranslationintoFinnish,
text (Kim et al., 2016; Enarvi and Kurimo,
inwhichperplexityonthePBMToutputisslightly
2016). Thus,wehaveusedthemostaccurate
lower,probablybecauseitsfluencywasimproved
automatictoolavailabletomeasurefluency.
byrerankingitwithaneuralLMsimilartotheone
weuseinthisexperiment(Sa´nchez-Cartagenaand
5 Reordering
Toral, 2016). The average relative difference, i.e.
considering all language directions, is notable at Inthissectionwemeasuretheamountofreorder-
−10.45%. Thus, our experiment shows that the ing performed by PBMT and NMT systems. Our
objective is to empirically determine whether: (i)
9http://data.statmt.org/
the recurrent neural networks in NMT systems
wmt16/translation-task/
training-monolingual-news-crawl.tgz producemorechangesinthewordorderofasen-
Monotonevs.
Languagedirection PBMTvs. Ref. NMTvs. Ref.
PBMT NMT Ref.
EN→CS 0.9273 0.9029 0.8295 0.8008 0.7964
EN→DE 0.8006 0.8229 0.7740 0.7560 0.7791
EN→FI 0.8912 0.9172 0.8367 0.7611 0.7819
EN→RO 0.8389 0.8378 0.7937 0.8312 0.8282
EN→RU 0.9342 0.9114 0.8364 0.8249 0.8240
CS→EN 0.7694 0.7589 0.7128 0.8000 0.8015
DE→EN 0.8036 0.7830 0.7409 0.7728 0.7943
RO→EN 0.8693 0.8427 0.8013 0.8187 0.8245
RU→EN 0.8170 0.7891 0.7247 0.7958 0.8069
Table5: AverageKendall’staudistancebetweenthewordalignmentsobtainedaftertranslatingthetest
setwitheachMTsystembeingevaluatedandamonotonealignment(left);andaverageKendall’staudis-
tancebetweenthewordalignmentsobtainedforeachMTsystem’stranslationandthewordalignments
of the reference translation (right). Larger values represent more similar alignments. If the difference
between the distances depicted in the two last columns is statistically significant according to paired
bootstrapresampling(Koehn,2004)withp = 0.05and1000iterations,thelargestdistanceisshownin
bold.
tence than an PBMT decoder; and whether (ii) a monotone word alignment. The similarity be-
these neural networks make the word order of the tween the reorderings produced by each MT sys-
translationsclosertothatofthereference. temandthereorderingsinthereferencetranslation
In order to measure the amount of reorder- can also be estimated as the distance between the
ing, we used the Kendall’s tau distance between corresponding word alignments. Table 5 shows
word alignments obtained from pairs of sen- thevalueofthesedistancesforthelanguagepairs
tences (Birch, 2011, Sec. 5.3.2). As the distance included in our evaluation. The average over all
needs to be computed from permutations,10 we the sentences in the test set of the distance pro-
turnedwordaligmentsintopermutationsbymeans posedbyBirch(2011)isdepicted.
ofthealgorithmdefinedbyBirch(2011,Sec. 5.2).
It can be observed that the amount of reorder-
Foreachlanguagedirection,wecomputedword
ing introduced by both types of MT systems is
alignments between the source-language side of
lower than the quantity of reordering in the refer-
the test set and the target-language reference, the
ence translation. NMT generally produces more
PBMT output and the NMT output by means of
changes in the structure of the sentence than
MGIZA++ (Gao and Vogel, 2008). As the test
PBMT. This is the case for all language pairs but
sets are rather small for word alignment (1999 to
two (EN→DE and EN→FI). A possible explana-
3000 sentence pairs depending on the language
tion for these two exceptions is the following: in
pair), we append bigger parallel corpora to help
theformerlanguagepair, thePBMTsystemishi-
ensure accurate word alignments and avoid data
erarchical (Williams et al., 2016) while in the lat-
sparseness. For languages for which in-domain
ter,theoutputwasrerankedwithneuralLMs.
(news) parallel training data is available (German
Concerning the similarity between the reorder-
andRussian),weappendthatdataset(NewsCom-
ings produced by both MT systems and those in
mentary). For the remaining languages (Finnish
the reference translation, out of 9 directions, in 5
andRomanian)weusethewholeEuroparlcorpus.
directionstheNMTsystemperformsareordering
The amount of reordering performed by each
closer to the reference, in 1 direction the PBMT
system can be estimated as the distance between
system performs a reordering closer to the refer-
thewordalignmentsproducedbythatsystemand
ence and in the remaining 3 directions the dif-
10Apermutationbetweenasource-languagesentenceand ferences are not statistically significant. That is,
atarget-languagesentenceisdefinedasthesetofoperations NMT generally produces reorderings which are
that need to be carried out over the words in the source-
closertothereferencetranslation. Theexceptions
language sentence to reflect the order of the words in the
target-languagesentence(Birch,2011,Sec.5.2). to this trend, however, do not exactly correspond
to the language pairs for which NMT underper- stable while NMT’s clearly decreases with sen-
formedPBMT. tence length. The results for the other language
Insummary, NMTsystemsachieve, ingeneral, directionsexhibitsimilartrends.
a higher degree of reordering than pure, phrase-
basedPBMTsystems,and,overall,thisreordering 0.03
results in translations whose word order is closer 0.02
nt
tothatofthereferencetranslation. me 0.01
e
ov
pr 0
m
6 SentenceLength ve I -0.01
ati
In this experiment we aim to find out whether Rel -0.02
the performances of NMT and PBMT are some- -0.03
1-5 6-10 11-1516-2021-2526-3030-3536-4041-4546-50 >50
how sensitive to sentence length. In this regard,
Sentence length (range)
Bentivoglietal.(2016)foundthat,fortranscribed
speeches, NMT outperformed PBMT regardless
Figure 2: Relative improvement of the best NMT
of sentence length while also noted that NMT’s
versus the best PBMT submission on chrF1 for
performancedegradedfasterthanPBMT’sassen-
different sentence lengths, averaged over all the
tence length increases. It should be noted, how-
languagepairsconsidered.
ever, that sentences in our content type, news, are
considerably longer than sentences in transcribed
Figure 2 shows the relative improvements of
speeches.11 Hence,thecurrentexperimentwillde-
NMToverPBMTforeachsentencelengthsubset,
terminetowhatextentthefindingsontranscribed
averaged over all the 9 language directions con-
speeches stand also for texts made of longer sen-
sidered. Weobserveacleartrendofthisvaluede-
tences.
creasingwithsentencelengthandinfactwefound
a strong negative Pearson correlation (-0.79) be-
54
tween sentence length and the relative improve-
52
ment(chrF1)ofthebestNMToverthebestPBMT
50
48 system.
1
F
hr 46 Thecorrelationsforeachlanguagedirectionare
c
44 shown in Table 6. We observe negative corre-
PBMT
42 NMT lations for all the language directions except for
40
DE→EN.
1-5 6-10 11-15 16-20 21-25 26-30 30-35 36-40 41-45 46-50 >50
Sentence length (range)
Direction CS DE FI RO RU
FromEN -0.72 -0.26 -0.89 -0.01 -0.74
Figure 1: NMT and PBMT chrF1 scores on sub-
IntoEN -0.19 0.10 - -0.36 -0.70
sets of different sentence length for the language
directionEN→FI.
Table 6: Pearson correlations between sentence
length and relative improvement (chrF1) of the
We split the source side of the test set in sub-
best NMT over the best PBMT system for each
sets of different lengths: 1 to 5 words (1-5), 6 to
languagepair.
10 and so forth up to 46 to 50 and finally longer
than 50 words (> 50). We then evaluate the out-
puts of the top PBMT and NMT submissions for
7 ErrorCategories
those subsets with the chrF1 evaluation metric.
Figure 1 presents the results for the language di- In this experiment we assess the performance of
rection EN→FI. We can observe that NMT out- NMT versus PBMT systems on a set of error
performs PBMT up to sentences of length 36- categories that correspond to five word-level er-
40,whileforlongersentencesPBMToutperforms ror classes: inflection errors, reordering errors,
NMT,withPBMT’sperformanceremainingfairly missing words, extra words and incorrect lexi-
cal choices. These errors are detected automat-
11AccordingtoRuizetal.(2014),sentencesoftranscribed
ically using the edit distance, word error rate
speechesinEnglishaverageto19wordswhilesentencesin
newsaverageto24words. (WER),precision-basedandrecall-basedposition-
Errortype EN→CS EN→DE EN→FI EN→RO EN→RU Average
Inflection −16.18% −13.26% −11.65% −15.13% −16.79% −14.60%
Reordering −7.97% −21.92% −12.12% −15.91% −6.18% −12.82%
Lexical −0.44% −3.48% −1.09% 2.17% −0.09% −0.59%
Table7: RelativeimprovementofNMTversusPBMTfor3errorcategories,forlanguagedirectionsout
ofEnglish.
Errortype CS→EN DE→EN RO→EN RU→EN Average
Inflection −4.38% −2.47% −3.65% −21.12% −7.91%
Reordering −8.68% −21.09% −9.48% −8.50% −11.94%
Lexical −1.92% −4.91% −3.90% 5.32% −1.35%
Table8: RelativeimprovementofNMTversusPBMTfor3errorcategories,forlanguagedirectionsinto
English.
independent error rates (hPER and rPER, respec- ball stemmer from NLTK12 for all languages ex-
tively) as implemented in Hjerson (Popovic´, cept for Czech, which is not supported. For
2011). Theseerrorclassesarethendefinedasfol- this language we used the aggresive variant in
lows: czech stemmer.13
Tables7and8showtheresultsforlanguagedi-
• Inflection error (hINFer). A word for which
rections out of English and into English, respec-
itsfullformismarkedasahPERerrorwhile
tively. Foralllanguagedirections,weobservethat
its base form matches the base form in the
NMT results in a notable decrease of both inflec-
reference. tion (−14.6% on average for language directions
out of EN and −7.91% for language directions
• Reordering error (hRer). A word that
into EN) and reordering (−12.82% from EN and
matches the reference but is marked as a
−11.94intoEN)errors. Thereductionofreorder-
WERerror.
ingerrorsiscompatiblewiththeresultsoftheex-
• Missingword(MISer). Awordthatoccursas perimentpresentedinSection5.14
deletion error in WER, is also a rPER error Differences in performance for the remaining
and does not share the base form with any error category, lexical errors, are much smaller.
hypothesiserror. In addition, the results for that category show a
mixedpictureintermsofwhichparadigmisbetter,
• Extra word (EXTer). A word that occurs as whichmakesitdifficulttoderiveconclusionsthat
insertion error in WER, is also a hPER error apply regardless of the language pair. Out of En-
and does not share the base form with any glish, NMT results in slightly less errors (0.59%
referenceerror. decrease on average) for all target languages ex-
cept for RO (2.17% increase). Similarly, in the
• Lexical choice error (hLEXer). A word that
opposite language direction, NMT also results in
belongs neither to inflectional errors nor to
slightly better performance overall (1.35% error
missingorextrawords.
reduction on average), and looking at individual
Due to the fact that it is difficult to disam- 12http://www.nltk.org
biguatebetweenthreeofthesecategories,namely 13http://research.variancia.com/czech_
stemmer/
missing words, extra words and lexical choice er-
14Althoughtheresultsdepictedbothinthissectionandin
rors (Popovic´ and Ney, 2011), we group them in
Section5showthatNMTperformsbetterreorderingingen-
auniquecategory,whichwerefertoaslexicaler- eral,resultsforparticularlanguagepairsarenotexactlythe
sameinbothsections. Thisisduetothefactthatthequality
rors.
ofthereorderingiscomputedindifferentways. Inthissec-
As input, the tool requires the full forms and tion,onlythosewordsthatmatchthereferenceareconsidered
base forms of the reference translations and MT whenidentifyingreorderingerrors,whileinSection5allthe
words in the sentence are taken into account. That said, in
outputs. For base forms, we use stems for practi-
Section5theprecisionoftheresultsdependsonthequality
cal reasons. These are produced with the Snow- ofwordalignment.
language directions NMT outperforms PBMT for lengthintheirevaluationdatasetwasconsid-
allofthemexceptRU→EN. erably shorter; and secondly, the NMT sys-
tems included in our evaluation operate on
8 Conclusions subword units, which increases the effective
sentencelengththeyhavetodealwith.
We have conducted a multifaceted evaluation to
compare NMT versus PBMT outputs across a • NMT performs better in terms of inflection
number of dimensions for 9 language directions. and reordering consistently across all lan-
Our aim has been to shed more light on the guage directions. We thus confirm that the
strengthsandweaknessesofthenewlyintroduced findingsofBentivoglietal.(2016)regarding
NMTparadigm,andtocheckwhether,andtowhat these two error types apply to a wide range
extent, these generalise to different families of of language directions. Differences regard-
source and target languages. Hereunder we sum- ing lexical errors are much smaller and in-
mariseourfindings: consistent across language directions; for 7
of them NMT outperforms PBMT while for
• The outputs of NMT systems are consider- theremaining2theoppositeistrue.
ably different compared to those of PBMT
systems. In addition, there is higher inter- The results for some of the evaluations, espe-
system variability in NMT, i.e. outputs by ciallyerrorcategories(Section 7)havebeenanal-
pairsofNMTsystemsaremoredifferentbe- ysed only superficially, looking at what conclu-
tween them than outputs by pairs of PBMT sions can be derived that apply regardless of lan-
systems. guagedirection. Nevertheless,allourdataispub-
liclyreleased,16soweencourageinterestedparties
• NMT outputs are more fluent. We have cor- to use this resource to conduct deeper language-
roboratedtheresultsofthemanualevaluation specificstudies.
of fluency at WMT16, which was conducted
onlyforlanguagedirectionsintoEnglish,and Acknowledgments
we have shown evidence that this finding is
truealsofordirectionsoutofEnglish. The research leading to these results is sup-
portedbytheEuropeanUnionSeventhFramework
• NMT systems introduce more changes in Programme FP7/2007-2013 under grant agree-
wordorderthanpurePBMTsystems,butless ment PIAP-GA-2012-324414 (Abu-MaTran) and
than hierarchical PBMT systems.15 Never- byScienceFoundationIrelandthroughtheCNGL
theless, for most language pairs, including Programme (Grant 12/CE/I2267) in the ADAPT
thoseforwhichthebestPBMTsystemishi- Centre (www.adaptcentre.ie) at Dublin City Uni-
erarchical, NMT’s reorderings are closer to versity.
thereorderingsinthereferencethanthoseof
PBMT. This corroborates the findings on re-
References
orderingbyBentivoglietal.(2016).
DzmitryBahdanau,KyunghyunCho,andYoshuaBen-
• Wehavefoundnegativecorrelationsbetween gio. 2015. Neural Machine Translation by Jointly
sentencelengthandtheimprovementbrought Learning to Align and Translate. arXiv preprint
arXiv:1409.0473.
by NMT over PBMT for the majority of
the languages examined. While for most
LuisaBentivogli,AriannaBisazza,MauroCettolo,and
sentence lengths NMT outperforms PBMT, Marcello Federico. 2016. Neural versus phrase-
for very long sentences PBMT outperforms based machine translation quality: a case study.
NMT. The latter was not the case in the arXivpreprintarXiv:1608.04631.
work by Bentivogli et al. (2016). We be-
AlexandraBirch. 2011. Reorderingmetricsforstatis-
lieve the reason behind this different find-
ticalmachinetranslation. Ph.D.thesis,TheUniver-
ing is twofold. Firstly, the average sentence sityofEdinburgh.
15Thelatterfindingappliesonlytoonelanguagedirection 16https://github.com/antot/neural_vs_
asonlyforthatonethebestPBMTsystemishierarchical. phrasebased_smt_eacl17
Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann, LanguageProcessing,SETQA-NLP’08,pages49–
Yvette Graham, Barry Haddow, Matthias Huck, 57,Columbus,Ohio.
Antonio Jimeno Yepes, Philipp Koehn, Varvara
Logacheva, Christof Monz, Matteo Negri, Aure- YvetteGraham,TimothyBaldwin,AlistairMoffat,and
lie Neveol, Mariana Neves, Martin Popel, Matt Justin Zobel. 2016. Can machine translation sys-
Post,RaphaelRubino,CarolinaScarton,LuciaSpe- temsbeevaluatedbythecrowdalone. NaturalLan-
cia, Marco Turchi, Karin Verspoor, and Marcos guageEngineering,FirstView:1–28,1.
Zampieri. 2016. Findings of the 2016 conference
onmachinetranslation. InProceedingsoftheFirst Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997.
Conference on Machine Translation, pages 131– Long short-term memory. Neural computation,
198,Berlin,Germany,August. 9(8):1735–1780.
Junyoung Chung, C¸aglar Gu¨lc¸ehre, KyungHyun Cho, MarcinJunczys-Dowmunt,TomaszDwojak,andRico
andYoshuaBengio. 2014. Empiricalevaluationof Sennrich. 2016. The AMU-UEDIN Submission
gatedrecurrentneuralnetworksonsequencemodel- to the WMT16 News Translation Task: Attention-
ing. arXivpreprintarXiv:1412.3555. basedNMTModelsasFeatureFunctionsinPhrase-
basedSMT. InProceedingsoftheFirstConference
Junyoung Chung, Kyunghyun Cho, and Yoshua Ben- on Machine Translation, pages 319–325, Berlin,
gio. 2016. A character-level decoder without ex- Germany,August.
plicit segmentation for neural machine translation.
arXivpreprintarXiv:1603.06147. YoonKim,YacineJernite,DavidSontag,andAlexan-
der Rush. 2016. Character-aware neural language
MartaR.Costa-Jussa` andJose´ A.R.Fonollosa. 2016. models. InProceedingsoftheThirtiethAAAICon-
Character-based neural machine translation. arXiv ferenceonArtificialIntelligence,pages2741–2749,
preprintarXiv:1603.00810. Phoenix,Arizona,USA,February.
JacobDevlin,RabihZbib,ZhongqiangHuang,Thomas
Philipp Koehn. 2004. Statistical significance tests
Lamar,RichardSchwartz,andJohnMakhoul. 2014.
for machine translation evaluation. In Proceedings
Fastandrobustneuralnetworkjointmodelsforsta-
of the Conference on Empirical Methods in Natu-
tistical machine translation. In Proceedings of the
ralLanguageProcessing,volume4,pages388–395,
52ndAnnualMeetingoftheAssociationforCompu-
Barcelona,Spain.
tationalLinguistics(Volume1: LongPapers),pages
1370–1380,Baltimore,Maryland,June. PhilippKoehn. 2010. StatisticalMachineTranslation.
Cambridge University Press, New York, NY, USA,
ShuoyangDing,KevinDuh,HudaKhayrallah,Philipp
1stedition.
Koehn, and Matt Post. 2016. The JHU Machine
Translation Systems for WMT 2016. In Proceed-
Chi-kiu Lo, Colin Cherry, George Foster, Darlene
ingsoftheFirstConferenceonMachineTranslation,
Stewart,RabibIslam,AnnaKazantseva,andRoland
pages272–280,Berlin,Germany,August.
Kuhn. 2016. NRCRussian-EnglishMachineTrans-
lation System for WMT 2016. In Proceedings of
John Duchi, Elad Hazan, and Yoram Singer. 2011.
theFirstConferenceonMachineTranslation,pages
Adaptive subgradient methods for online learning
326–332,Berlin,Germany,August.
and stochastic optimization. Journal of Machine
LearningResearch,12(Jul):2121–2159.
Minh-Thang Luong, Hieu Pham, and Christopher D.
Manning. 2015. Effective approaches to attention-
NadirDurrani,HelmutSchmid,andAlexanderFraser.
basedneuralmachinetranslation. InProceedingsof
2011. AJointSequenceTranslationModelwithIn-
the2015ConferenceonEmpiricalMethodsinNatu-
tegratedReordering. InProceedingsofthe49thAn-
ralLanguageProcessing,pages1412–1421,Lisbon,
nual Meeting of the Association for Computational
Portugal,September.
Linguistics: Human Language Technologies, pages
1045–1054,Portland,Oregon,USA,June.
KishorePapineni,SalimRoukos,ToddWard,andWei-
SeppoEnarviandMikkoKurimo. 2016. TheanoLM– Jing Zhu. 2002. Bleu: a method for automatic
AnExtensibleToolkitforNeuralNetworkLanguage evaluation of machine translation. In Proceedings
Modeling. InProceedingsofthe17thAnnualCon- of40thAnnualMeetingoftheAssociationforCom-
ferenceoftheInternationalSpeechCommunication putationalLinguistics,pages311–318,Philadelphia,
Association. Pennsylvania,USA,July.
Christian Federmann. 2012. Appraise: An open- Maja Popovic´ and Hermann Ney. 2011. Towards au-
source toolkit for manual evaluation of machine tomaticerroranalysisofmachinetranslationoutput.
translation output. The Prague Bulletin of Mathe- Comput.Linguist.,37(4):657–688,December.
maticalLinguistics,98:25–35,September.
Maja Popovic´. 2011. Hjerson: An open source tool
QinGaoandStephanVogel. 2008. Parallelimplemen- for automatic error classification of machine trans-
tations of word alignment tool. In Software Engi- lationoutput. ThePragueBulletinofMathematical
neering,Testing,andQualityAssuranceforNatural Linguistics,96:59–67.