Table Of Content

Unified Framework of Feature Based Adaptation for Statistical Speech Synthesis and Recognition THÈSE NO 5612 (2013) PRÉSENTÉE LE 4 AVRIL 2013 À LA FACULTÉ DES SCIENCES ET TECHNIQUES DE L'INGÉNIEUR LABORATOIRE DE L'IDIAP PROGRAMME DOCTORAL EN GÉNIE ÉLECTRIQUE ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE POUR L'OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES PAR Lakshmi BABU SAHEER acceptée sur proposition du jury: Dr J.-M. Vesin, président du jury Prof. H. Bourlard, directeur de thèse A. W. Black, rapporteur Dr R. Schlüter, rapporteur Prof. J.-Ph. Thiran, rapporteur Suisse 2013 Faithistakingthefirststepevenwhen youdon’tseethewholestaircase. —MartinLutherKing,Jr. Tomylovinghusband... Acknowledgements Thesepastfouryearshasbeenawonderfulexperienceforme.Ithasbeenatoughandequally excitingjourney.Icouldnothavecompleteditsuccessfullywithouttheloveandsupportfrom alotofpeoplearoundme.Firstly,IthankmysupervisorsPhilipN.GarnerandJohnDinesfor theirdedicatedsupportandguidance.EventhoughEPFLdeclinedtorecognisethemofficially asmysupervisors,theywarmlyextendedtheirgeneroushelpateverystepofmyPhD.This researchwouldnothavebeenfruitfulwithouttheirencouragementandguidance.Theygave greatadvicesandalwaysshowedmetherightdirection.Ireallyenvythegreattechnicalclarity thatPhilhasforanytopicwediscussedandIlearnedalotfromhimthanfromanytextbooks orresearchpapers.Philwouldbealwaysaround(anytimeoftheday)andpatientlycorrect everybitofmytechnicalwritingusuallyinthelateeveningsthattooinpubs.Johnonthe otherhandwasagreatsourceofmotivation,supportingeveryactivitythatIwantedtopursue (eveniftheysoundedstupid).Healwaysmanagedtofindtimeforreviewingmyworkinspite ofhisbusyscheduleasastart-upCTO.Iconsidermyselfextremelyluckytohaveyoubothas mysupervisors. IwouldalsoliketothankDr.JunichiYamagishifromthecentreforspeechtechnologyre- search(CSTR),Edinburghforallhisguidanceandcollaboration.Hehelpedmewithsome importantpartsofmyPhDresearchandprovidedmostofthedataandscriptsformywork. Thiscollaborationwasveryfruitfulaswepublishedaconferencepaperandareintheprocess ofwritingajournalpaper.Healwayscameupwithanumberofideasforresearchandeven helpedmepersonallyinmytripstoJapanforconferences.Heactedlikeaco-supervisorto mewithhisguidanceandperformedsomesubjectivelisteningtestsformeatCSTR.Itwasa pleasuretoworkwithmyPhDcolleagueandfriend,HuiLiang.Wewerebothworkingforthe Effectivemultilingualinteractioninmobileenvironments(EMIME)projectandhelpedand supportedeachotherwithhealthydiscussionsanddebates.Iwashappytohavehimaround tosharetheworkandideas.Wedidalotofresearchtogethermainlyduringthebeginningof theprojectandthecross-lingualworkpresentedinthisthesis.Hewasagreatcompanionto haveespeciallyduringthetravelsforconferencesandprojectmeetings. ItwasagreatopportunitytoworkontheEMIMEprojectwiththedifferentpartners.Itwas apleasuretomeetallthepartnersattheprojectmeetingsanddiscussdifferenttechnical aspectsoftheprojectapartfromhavingfun.IthankProf.SteveRenals,Dr.SimonKing(both fromCSTRandUniversityofEdinburgh),Dr.MikkoKurimo(AaltoUniversity,Finland),Prof. KeiichiTokuda,(NagoyaInstituteofTechnology,Japan),Dr.JileiTian(Nokia,China)andmany others.Iextendmygratitudetomyfundingsources,theEMIME(EuropeanUnionFP7)project v Acknowledgements andV-FAST(Hasler)projectforthefinancialsupportinthecourseofmydoctoralstudies. IwouldliketothankIdiapmanagementespeciallyProf.HervéBourlard(mysupervisorand directorofIdiap)forprovidingmethisgreatopportunitytoworkatIdiapandensuringallthe resourcesneededformyresearch.Prof.Bourlardwasagreatsourceofpersonalinspiration.I thankthesecretaries,Mrs.NadineRousseauandMrs.SylvieMilliusforalltheadministrative support.ItwasnoteasytofindmywayoutinSwitzerlandfromthefirstdayofmyarrivaltill date.SpecialthankstothedeputydirectorofIdiap,Dr.FrancoisFogliaforhissupportspecially duringtheinternationalcreatechallenge(ICC2012)andhissupportandconfidenceinmy project.IthankDr.MilosCernakfortheallthehelpandsupportespeciallyfortheICCproject. Weareagreatteamandplantocontinuethiscollaborationasfaraswecan.TheICCgroup wasagoodsourceofhappiness.IalsothanktheothersupportstaffatIdiap,FrankFormaz, NorbertCrettol,VincentSpano,AlexandreNanchen,EdGregg,ChristopheEcoeurandseveral others.SpecialthankstomycolleaguesFlavioTarsettiandLaurentEl-Shafeyforthehelpwith Frenchtranslations.IamluckytohaveknownParticiaEmonetwithherhelpinFrenchand mypersonallife. Specialthankstomyfriendandcolleague,AfsanehAsaeiforbeingtheretosharemyhappiness andtocomfortmeintimesofdistress.Wehelpedeachotherinourordeals.Similarily,my friendandcolleague,RamyaRasipuramforbeingagreatsourceofcomfort.Webothdelivered our babies around the same time and could easily share our happiness and troubles. We hadsomegreathikesorganisedbyMarcoFornoni,LaurentEl-Shafey,andDeepuVijayase- nan.ThankstotheIndiancommunityinMartigny(Samuel,Jamie,Deepu,Ramya,Murali, Venkatesh,Abhilasha,Jagan,Gokul,Sriram,Harsha,Mathew,Dineshandseveralothers)for keepingmysociallifealivewithgreatactivitieslikeIndiandinnernight,barbecuesandother get-togethersfromtimetotime.TherealotofotherIdiapcolleagueswhomademyPhDlife enjoyableincludingOya,Daiira,Joan,Paco,Serena,Tatiana,Niklas,Gelareh,Elham,Sameera, Jean-marc,Petr,Nicolae,andAileentonameafew. Iextendmywarmgratitudetoallmyfamilyandfriends.Specially,myfather,motherand sisterfortheirencouragementandlove.Also,myfather-in-lawandmother-in-lawfortheir loveandsupport.Finally,Iamextremelyhappytohaveawonderfulhusband,Saheerandan adorabledaughter,Norawithoutwhoseunconditionalloveandsupport,thisPhDwouldhave beenimpossible.Althoughsheisjustoneyearold,mydaughterco-operatedwitheverysingle activityIhadtotakeup.Myhusbandisalwaysasourceofencouragement,convincingmeto takeupimpossibletasks,boostingmyconfidencebyenlighteningmewithmypotentialand helpingmeineverywayhecan.Iamreallyfortunatetohavehimasmylifepartner.Idedicate thisworktohim. ThelistofpeoplementionedinthisacknowledgementisnotexhaustiveandIapologisein caseIaccidentallymissedoutsomeoftheveryspecialpeoplewhohaveinfluencedmyPhD life. Martigny,28November2012 LakshmiSaheer. vi Abstract Theadventofstatisticalparametricspeechsynthesishaspavednewwaystoaunifiedframe- workforhiddenMarkovmodel(HMM)basedtexttospeechsynthesis(TTS)andautomatic speechrecognition(ASR).ThetechniquesandadvancementsmadeinthefieldofASRcannow beadoptedinthedomainofsynthesis.Speakeradaptationisawell-advancedtopicinthearea ofASR,wheretheadaptationdatafromatargetspeakerisusedtotransformthecanonical modelparameterstorepresentaspeakerspecificmodel.Featureadaptationtechniqueslike vocaltractlengthnormalization(VTLN)performthesametaskbytransformingthefeatures; thiscanbeshowntobeequivalenttomodeltransformation.ThemainadvantageofVTLNis thatitcandemonstratenoticeableimprovementsinperformancewithverylittleadaptation dataandcanbeclassifiedasarapidadaptationtechnique. VTLNisawidelyusedtechniqueinASR,andcanbeusedinTTStoimprovetherapidadapta- tionperformance.InTTS,thetaskistosynthesizespeechthatsoundslikeaparticulartarget speaker.UsingVTLNforTTSisfoundtomaketheoutputsynthesizedspeechsoundquite similartothetargetspeakerfromhisveryfirstutterance.Anall-passfilterbasedbilineartrans- formwasimplementedforthemel-generalizedcepstral(MGCEP)featuresoftheHMM-based speechsynthesissystem(HTS).Theinitialimplementationwasusingagridsearchapproach thatselectsthebestwarpingfactorforthespeechspectrumfromagridofavailablevalues usingthemaximumlikelihoodcriterion.VTLNwasshowntogiveperformanceimprovements intherapidadaptationframeworkwherethenumberofadaptationsentencesfromthetarget speakerwaslimited.But,thistechniqueinvolveshugetimeandspacecomplexitiesandthe rapidadaptationdemandsforanefficientimplementationoftheVTLNtechnique. Tothisend,anefficientexpectationmaximization(EM)basedVTLNapproachwasimple- mentedforHTSusingBrent’ssearch.UnliketheASRfeatures,MGCEPdoesnotuseafilterbank (inordertofacilitatethespeechreconstruction)andthisprovidesequivalencetothemodel transformationfortheEMimplementation.Thisfacilitatestheestimationofwarpingfactors tobeembeddedintheHMMtrainingusingthesamesufficientstatisticsasinconstrained maximumlikelihoodlinearregression(CMLLR).Thisworkaddressesalotofchallengesfaced intheprocessofadoptingVTLNforsynthesisduetothehigherdimensionalityofthecepstral featuresusedintheTTSmodels.Themainideawastounifythetheoryandpractiseinthe implementationofVTLNforbothASRandTTS.Severaltechniqueshavebeenproposedinthis thesis,inordertofindthebestfeasiblewarpingfactorestimationprocedure.Estimationof thewarpingfactorusingthelowerordercepstralfeaturesrepresentingthespectralenvelope isdemonstratedtobethebestapproach.Differentevaluationsonstandarddatabasesare vii Abstract performedinthisworktoillustratetheperformanceimprovementsandperceptualchallenges involvedintheVTLNadaptation. VTLNhasonlyasingleparametertorepresentthespeakercharacteristicsandhence,has thelimitationofnotscalingtotheperformanceofotherlineartransformbasedadaptation methodswiththeavailabilityoflargeamountsofadaptationdata.Severaltechniquesare demonstratedinthisworktocombinethemodelbasedadaptationlikeconstrainedstructural maximumaposteriorilinearregression(CSMAPLR)withVTLN,onesuchtechniquebeing usingVTLNasthepriortransformattherootnodeofthetreestructureoftheCSMAPLR system.Thus,alongwithrapidadaptation,theperformancescaleswiththeavailabilityof moreadaptationdata.ThesetechniquesalthoughdevelopedforTTS,canalsobeeffectively usedinASR.ItwasalsoshowntogiveimprovementsinASRespeciallyforscenarioslikenoisy speechconditions.OtherimprovementstorapidadaptationincludingabiastermforVTLN, multipletransformbasedVTLNusingregressionclassesandVTLNpriorfornon-structural MAPLRadaptationarealsoproposed.ThesetechniquesalsodemonstratedbothASRandTTS performanceimprovements.Also,afewspecialscenarios,specificallycross-lingualspeech, cross-genderspeech,childspeechandnoisyspeechevaluationsarepresentedwheretherapid adaptationmethodspresentedinthisworkwasshowntobehighlybeneficial.Mostofthese methodswillbepublishedasextensionstotheopen-sourceHTStoolkit. Keywords Vocaltractlengthnormalization,Mel-generalizedcepstralfeatures,All-passfilter basedbilineartransformations,Rapidfeatureadaptation,HMM-basedstatisticalparametric speechsynthesis(HTS),HMM-basedautomaticspeechrecognition(ASR),Unifiedmodeling and adaptation of ASR and TTS, Expectation maximization, Model transformations, Con- strained structural maximum a posteriori linear regression, Constrained likelihood linear regression. viii Résumé L’introductiondeméthodesstatistiquesparamétriquespourlasynthèsedelaparoleaouvert lavoieàuncadreunifiépourlasynthèsedeparoleàpartird’uneentréetextuelle(SPET)etla reconnaissanceautomatiquedelaparole(RAP)reposantsurdesmodèlesdeMarkovcachés (MCC).LestechniquesetlesaméliorationseffectuéespourlaRAPpeuventdésormaisêtre adoptéesdansledomainedelasynthèse.L’adaptationaulocuteurestunsujettrèsdéveloppé dans le domaine de la RAP, où des données d’adaptation issues d’un locuteur cible sont utiliséespourmodifierlesparamètresd’unmodèlegénériqueinitialafind’obtenirunmodèle spécifiqueaulocuteur.Destechniquesd’adaptationdeprimitivescommelanormalisationde lalongueurduconduitvocal(NLCV)effectuentlamêmetâcheenmodifiantlesprimitives; ilpeutêtreétabliquecetteapprocheestéquivalenteàunetransformationdumodèle.Le principalavantagedelaNLCVestquedesaméliorationssignificativesdesperformancesont étéconstatéeslorsquepeudedonnéesd’adaptationsontdisponibles,etqu’ellepeutêtre considéréecommeunetechniqued’adaptationrapide. LaNLCVestunetechniquetrèsrépanduepourlaRAP,etellepeutêtreutiliséepourlaSPETafin depermettreuneadaptationrapide.PourlaSPET,l’objectifestdesynthétiserlaparoledesorte qu’elleressembleàcelled’unlocuteurcibledonné.Ilaétéconstatéquel’utilisationdelaNLCV pourlaSPETconduitàdelaparolesynthétiséequiressembleàcelledulocuteurcible,etce, dèsledébutdel’élocution.Unetransformationbilinéairereposantsurunfiltredéphaseuraété implémentéepourlescoefficientscepstrauxsurl’échelledeMelgénéralisée(CCEMG)utilisés parlesystèmedesynthèsedelaparolebasésurdesMCC(SSPM).L’implémentationinitiale utiliseuneméthodederechercheparquadrillagequisélectionnelefacteurdedéformation optimalpourlespectrevocalàpartird’unegrilledevaleursdisponiblesenutilisantlecritère demaximumdevraisemblance.IlaétéétabliquelaNLCVconduitàuneaméliorationdes performancesdansuncadred’adaptationrapideoùlenombredephrasespourl’adaptation aulocuteurcibleestlimité.Toutefois,cettetechniqueaunecomplexitéentempsetenespace mémoireimportante,alorsquel’adaptationrapideexigeuneimplémentationefficacedela techniqueNLCV. A cette fin, une approche efficace de la NLCV reposant sur un algorithme d’espérance- maximisation(EM)aétéimplémentéepourleSSPMenutilisantlaméthodedeBrent.Con- trairementauxprimitivesutiliséespourlaRAP,lesCCEMGn’emploientpasdebatteriede filtres(afindefaciliterlareconstructiondelaparole),cequilesrapprochedelatransformation demodèlepourl’implémentationEM.Celasimplifiel’estimationdesfacteursdedéforma- tionquipeutêtreintégréedanslaphased’entraînementdesMCCenutilisantlesmêmes ix Résumé statistiquesexhaustivesquepourlarégressionlinéaireparmaximumdevraisemblanceavec contraintes(RLMVC).Cettethèseabordelesnombreuxdéfisrencontrésenvued’utiliserla NLCVpourlasynthèse,principalementenraisondeladimensionplusgrandedesprimitives cepstralesutiliséespourlesmodèlesdelaSPET.L’idéecentraleestd’unifierlathéorieetla pratiquedansl’implémentationdelaNLCVpouràlafoislaRAPetlaSPET.Denombreuses techniquessontproposéesdanscettethèse,afindetrouverlameilleureprocédureréalisable pourl’estimationdufacteurdedéformation.Ilestétabliquelameilleureapprochepourcette tâchereposesurl’utilisationdesprimitivescepstralesd’ordreinférieur.Plusieursévaluations sur des bases de données standard ont été conduites afin d’illustrer les améliorations de performancesetlesdifficultésdeperceptionrencontréesaveclaNLCV. LaNLCVemploieunparamètreuniquepourreprésenterlescaractéristiquesdulocuteur.Par conséquent,cettetechniquenepermetpasd’atteindrelesperformancesdesautresméthodes d’adaptationreposantsurdestransformationslinéaireslorsqueunegrandequantitédedon- néesd’adaptationestdisponible.Denombreusestechniquessontprésentéesdanscettethèse afindecombinerdestechniquesd’adaptationreposantsurdesmodèlescommeparexemple larégressionlinéaireparmaximumaposterioristructurelaveccontraintes(RLMAPSC)avecla NLCV.Unetelletechniqueproposed’utiliserlaNLCVcommeunetransformationpréliminaire surlenœudracinedel’arborescenced’unsystèmeàRLMAPSC.Ainsi,enutilisantégalement uneadaptationrapide,lesperformancess’améliorentlorsquelaquantitédedonnéesd’adapta- tiondisponiblescroît.Cestechniques,bienquedéveloppéespourlaSPET,peuventaussiêtre employéespourlaRAP.Ilaainsiétéconstatéquecesméthodesconduisentàdesaméliorations pourlaRAP,surtoutsouscertainesconditionscommedansdesenvironnementsbruyants.Par ailleurs,d’autresaméliorationspourl’adaptationrapidesontégalementproposéescomme untermecorrectifpourlaNLCV,uneNLCVàbasedetransformationsmultiplesetutilisant desclassesderégression,etuneNLCVavecunedistributionaprioripourl’adaptationbasée surlaRLMAPnonstructurel.Cestechniquesontconduitàdesaméliorationspouràlafois laRAPetlaSPET.Enoutre,quelquesscénariosparticuliers,commedesévaluationsutilisant delaparoleinterlinguale,delaparoledepersonnesdesexesopposés,delaparoled’enfants oudelaparolebruitéesontprésentées,pourlesquelleslesméthodesd’adaptationrapide décritesdanscettethèsesontextrêmementbénéfiques.Enfin,laplupartdecesméthodes serontpubliéesentantqu’extensionspourlelogiciellibreHTS. Mots-clés Normalisationdelalongueurduconduitvocal,coefficientscepstrauxsurl’échelle deMelgénéralisée,transformationsbilinéairesàbasedefiltresdéphaseurs,adaptationrapide deprimitives,synthèsedelaparolestatistiqueparamétriquereposantsurdesMCC,recon- naissanceautomatiquedelaparoleàbasedeMCC,méthodedemodélisationetd’adaptation unifiéepourlaRAP,espérance-maximisation,transformationsdemodèles,régressionlinéaire parmaximumaposterioristructurelaveccontraintes,régressionlinéairedevraisemblance aveccontraintes x

Description:

Venkatesh, Abhilasha, Jagan, Gokul, Sriram, Harsha, Mathew, Dinesh and . de la parole de personnes de sexes opposés, de la parole d'enfants.

Unified Framework of Feature Based Adaptation for Statistical Speech Synthesis and Recognition PDF

167 Pages·2012·2.82 MB·English

by Lakshmi Babu Saheer

Checking for file health...

Save to my drive

Quick download

Download

Download Unified Framework of Feature Based Adaptation for Statistical Speech Synthesis and Recognition PDF Free - Full Version

by Lakshmi Babu Saheer| 2012| 167 pages| 2.82| English

Download Unified Framework of Feature Based Adaptation for Statistical Speech Synthesis and Recognition by Lakshmi Babu Saheer in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Unified Framework of Feature Based Adaptation for Statistical Speech Synthesis and Recognition

Venkatesh, Abhilasha, Jagan, Gokul, Sriram, Harsha, Mathew, Dinesh and . de la parole de personnes de sexes opposés, de la parole d'enfants.

Detailed Information

Author:	Lakshmi Babu Saheer
Publication Year:	2012
Pages:	167
Language:	English
File Size:	2.82
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Unified Framework of Feature Based Adaptation for Statistical Speech Synthesis and Recognition Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Unified Framework of Feature Based Adaptation for Statistical Speech Synthesis and Recognition PDF?

Yes, on https://PDFdrive.to you can download Unified Framework of Feature Based Adaptation for Statistical Speech Synthesis and Recognition by Lakshmi Babu Saheer completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Unified Framework of Feature Based Adaptation for Statistical Speech Synthesis and Recognition on my mobile device?

After downloading Unified Framework of Feature Based Adaptation for Statistical Speech Synthesis and Recognition PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Unified Framework of Feature Based Adaptation for Statistical Speech Synthesis and Recognition?

Yes, this is the complete PDF version of Unified Framework of Feature Based Adaptation for Statistical Speech Synthesis and Recognition by Lakshmi Babu Saheer. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Unified Framework of Feature Based Adaptation for Statistical Speech Synthesis and Recognition PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.