Environmental Noise Embeddings For Robust Speech Recognition SuyounKim1,BhikshaRaj1,IanLane1 1ElectricalComputerEngineering CarnegieMellonUniversity [email protected], [email protected], [email protected] Abstract noisedirectlyfromthedatabygivingadditionalcues. Instead ofprovidingthepreprocessedornormalizedfeaturetothenet- We propose a novel deep neural network architecture for work,thenetworkfiguresoutthenormalizationduringtraining speech recognition that explicitly employs knowledge of the byusingitsexceptionalmodelingpower.Inordertodothat,the background environmental noise within a deep neural net- networkisinformedbythenoiseidentityfeatures. TheNoise- work acoustic model. A deep neural network is used to pre- 6 AwareTraining(NAT)proposedin[9]usesanestimateofthe dict the acoustic environment in which the system in being 1 noiseforthenoiseidentityfeature. Inthisworkweextendthe used. The discriminative embedding generated at the bottle- 0 priorwork,NAT,withanimprovedmethodtomodelandrepre- necklayerofthisnetworkisthenconcatenatedwithtraditional 2 sentdynamicenvironmentalnoise. acoustic features as input to a deep neural network acoustic p model. ThroughaseriesofexperimentsonResourceManage- Relatedworkincludestheuseofidentityvector(i-vector) e ment,CHiME-3task,andAurora4,weshowthattheproposed representationbasedontheGaussianMixtureModels(GMMs). S approach significantly improves speech recognition accuracy Thei-vectorisapopulartechniqueforspeakerverificationand in noisy and highly reverberant environments, outperforming speakerrecognition,anditcapturestheacousticcharacteristics 9 2 multi-condition training, noise-aware training, i-vector frame- ofaspeaker’sidentityinalow-dimensionalfixed-lengthrepre- work,andmulti-tasklearningonbothin-domainnoiseandun- sentation. For this reason, it has been used for speaker adap- ] seennoise. tation in ASR [13, 14]. However, the i-vector framework has L IndexTerms:robustspeechrecognition,noiseadaptation onlybeenappliedtospeakeradaptation,nottonoiseadaptation. C Thesuccessofthei-vectorframeworkinspeakeradaptationof . 1. Introduction DNNacousticmodelsmotivatedustolookattheirapplicability s tonoiseadaptation. c In many speech recognition tasks, despite an increase in the In this work, we propose a noise adaptation framework [ variability of the training data, it is still common to have sig- that can dynamically adapt to various testing environments. 2 nificantmismatchesbetweentestenvironmentandtrainingen- Ourframeworkincorporatesenvironmentalacousticsduringthe v vironment, e.g. ambient noise and reverberation. This envi- DNN acoustic model to improve robustness in environmental 3 ronmentaldistortionresultsintheperformancedegradationof distortion. The model explicitly employs knowledge of the 5 automatic speech recognition (ASR). Various techniques have backgroundnoiseandlearnsthelow-dimensionalnoisefeature 5 beenintroducedforincreasingrobustnessinthissituation. fromthediscriminativelytrainedDNN,whichwecallnoiseem- 2 Over the years, prior works on improving robustness un- beddings. ThroughaseriesofexperimentsonResourceMan- 0 derenvironmentaldistortionhasgenerallyfallenintothreecate- agement(RM)[15],CHiME-3task[16],andAurora4datasets 1. gories:featureenhancement,transformation,andaugmentation [17], we show that our proposed approach improves speech 0 with auxiliary information. Feature enhancement approaches recognitionaccuracyinvarioustypesofnoisyenvironments.In 6 trytoattenuatethecorruptingnoiseintheobservationandde- addition,wealsocompareourapproachwiththeNAT[9],thei- 1 velopmorerobustfeaturerepresentationinordertominimize vectorframework[18,14],andamulti-tasklearningframework : themismatchesbetweentrainingandtestconditions. Manyof thatjointlypredictsnoisetypeandcontext-dependenttriphone v these methods have been proposed to suppress noise, for ex- states. Xi ample,themodel-basedcompensationmethods,VectorTaylor Thepaperisorganizedasfollows. InSection2wereview Series(VTS),attempttomodelthenonlinearenvironmentfunc- other noise adaptation systems, NAT, i-vector framework and r tionandthenapplythecompensationfortheeffectsofnoise[1], a our proposed noise adaptation framework. In Section 3, we thenoiserobustfeatureextractionalgorithmsbasedonthedif- evaluate the performance of the proposed approach. Finally, ferentcharacteristicsofspeechandbackgroundnoisehavebeen wedrawconclusionsanddiscussfutureworkinSection4. developed [2, 3], and the missing feature approaches, attempt tomaskorimputetheunreliableregionsofthespectralcompo- 2. EnvironmentalNoiseAdaptation nentsbecauseofdegradationduetonoisehavebeenproposed [4,5,6]. Transformationapproachesattempttotransformthe 2.1. NoiseAwareTraining featureormodelspaceadaptivelyaccordingtoeachspeakeror eachutterance[7,8]. One framework that has been used for the noise adaptation is Onerecentapproachinvolvesaugmentingtheacousticfea- Noise-Aware Training (NAT) which is proposed in [9]. NAT tures with auxiliary information that characterizes the testing is designed to make the DNN acoustic model automatically conditions, such as a noise estimates [9]. This approach at- learntherelationshipbetweeneachobservedinputandthenoise tempts to enable the Deep Neural Network acoustic model presentinthesignalbyaugmentinganadditionalcue,thenoise [10,11,12]tolearntherelationshipbetweennoisyspeechand estimates. Thisnoiseestimateissimplycomputedbyaverag- (a)+NDNN (b)MTL Figure 1: Illustration of our approach noise embedding adaptive training +N DNN and MTL framework. (a)+N DNN is sequentially trainingtwopartsofthesamenetwork: (1)trainenvironmentalembeddings,then(2)trainthetriphonenetwork. Bycontrast,(b)MTL isjointlyoptimizedthetwocomponentsofthenetwork. ing the first and last ten frames of each utterance. The NAT work on NAT, and extends the way of representing the noise achievesapproximately2%relativeimprovementinworderror adaptationdata.UnlikeNAT,oursystemcandynamicallyadapt rate(WER)evaluatingontheAurora4dataset[17]. However, todifferenttestingenvironmentsbyappendingvaryingnoisees- astheNATassumesthenoiseisstationaryandusesanoisees- timatesateachinputframe. timatethatisfixedovertheutterance, theperformanceofthis technique relies on the characteristic of the background noise Ourproposedsystemconsistsoftwosubnetworkswithdif- andpriorknowledgeoftheregionofthenoisyframe. Inthis ferent objectives for each. As shown in Figure 1a, the left work,weexploreawaytorepresentthenoisetoimproveadap- Dnoise learns the noise embeddings and the right Dphoneme tationperformance. istheregularacousticmodel. Thenetworksareoptimizedse- quentially. 2.2. IdentityVectorforNoise First,welearnthenoiseembeddingsateachframefroma The i-vector framework is a popular technique for speaker narrowbottleneckhiddenlayerinD ,givenvarioustypes noise recognition and it captures the acoustic characteristics of a ofnoisyspeechdata.WestartwithtrainingD withthereg- noise speaker’sidentityinalow-dimensionalfixed-lengthrepresenta- ular acoustic feature, X, to classify the different ground-truth tion.Fromthisreason,ithasbeenusedasaspeakeradaptation categorical labels, the noise types, YN. We use a bottleneck techniqueforASRandconsistentlyachieves5-6%relevantim- neural network for D . A bottleneck neural network is a noise provementinWER(%). Thesuccessofthei-vectorframework kind of multi-layer perceptron (MLP) in which one of the in- inspeakeradaptationofDNNacousticmodelsmotivatedusto ternallayershasasmallnumberofhiddenunits,relativetothe lookattheirapplicabilitytonoiseadaptation. sizeoftheotherlayers.Thecommonapproachtoextractingthe Here we review the main idea behind the i-vector frame- feature vectors is to use the activations of the bottleneck hid- work. Theacousticfeaturevectorsxt ∈ IRD areseenassam- denunitsasfeatures[19]. Ithasbeenshownthatthefeatures plesgeneratedfromauniversalbackgroundmodel(UBM)rep- generated from the bottleneck network can be classified into resentedasaGMMwithKdiagonalcovarianceGaussians.The alow-dimensionalrepresentationbyforcingthissmalllayerto keyofthei-vectoralgorithmistoassumealineardependence createaconstrictioninthenetwork.Consequentlyitcanberep- betweenthespeaker-adaptedwithrespecttotheUBM,super- resentedasanonlineartransformationandleadstodimension- vector s, and the speaker-independent, the mean of supervec- alityreductionoftheinputfeatures. Wetakeadvantageofthis tors,m: facttogeneratethelow-dimensionalsecondaryfeaturevector. To make the bottleneck feature vector embed the discrimina- s=m+Tw (1) tiveacousticcharacteristicsofbackgroundnoiseinsteadofthe whereTofsizeDxM,isthefactorloadingsubmatrixcor- phonetic characteristics, the task of the network is to classify respondingtocomponentkandwisthesizeoftheM speaker differentnoiseconditions. identityvector(i-vector)correspondingtospeaker.Weestimate OncetheD isoptimized,weextractthenoiseembed- theposteriordistributionofwgivenspeakersdatax (s)using noise t dingsXeateachinputframefromthebottleneckhiddenlayer theEMalgorithm. Thei-vectorextractiontransformsareesti- in D . The learned noise embeddings Xe are then con- matediterativelybyalternatingbetweenevaluatingwinEstep noise catenatedtoeachcorrespondingoriginalacousticfeatureframe. andupdatingthemodelparametersTinMstep. The noise estimates keep changing over the time frame; our Inthiswork,insteadofusingthespeakerIDinthegeneral noiseadaptationtechniquedoesnotrequiretheassumptionthat application of the i-vector system, we used the noise type for thenoiseisstationary. generatingnoisei-vector. Finally,wetrainD withinputfeaturesX andXe 2.3. Learningenvironmentalnoiseembeddings phoneme to classify the phonetic states, YP, as in usual acoustic mod- In this subsection we describe our approach, which explic- eling. Inthedecodingstep,thenoiselabelisnotrequiredand itlyemploysknowledgeofthebackgroundenvironmentalnoise wecanobtainthenoiseembeddingbyforwardingtheacoustic withinaDNNacousticmodeltoimproverobustnessunderen- featurestotheoptimizedD . TheFigure1aillustratesthe noise vironmentaldistortion. Ourapproachismotivatedbyprevious overallarchitecture. Figure 2: A comparison of the final input features of the unseen noise set, Aurora4 evaluation [17], from the different algorithms baseline,+N NAT,+N GMM,and+N DNN.Therandomlyselected700inputfeaturesprojectedin2-dimensionalspacebyLDA.The 40-dimensionalnoisefeaturesgeneratedfromthemodeltrainedonCHiME-3trainingsetwereaugmented. Thecolorsrepresenteach typeofnoisecondition. 2.4. Multi-tasklearning [2],andthereverberationsimulationswereaccomplishedusing theRoomImpulseResponseopensourcepackage[21],andthe WerecognizethatourframeworkdescribedinSection2.3isse- virtualroomsizewas5x4x6meters. quentiallytrainingtwopartsofthesamenetwork.Firstwetrain The CHiME-3 challenge task includes speech data that is theenvironmentalembeddings,andthenwefixitandtrainthe recorded in real noisy environments (on a bus, in a cafe, in triphonenetwork. Asacomparator,wealsoattemptjointopti- a pedestrian area, and at a street junction). The training set mization. Herethetwocomponentsofthenetworkarejointly has8,738noisyutterances(18hours),thedevelopmentsethas optimized. Thisjointoptimizationapproachcanbeeffectively 3,280 noisy utterances (5.6 hours), and the test set has 2,640 amulti-tasklearningsetupwhichisamethodthatjointlylearns noisyutterances(4.5hours). morethanoneproblemtogetheratthesametimeusingshared The evaluation set of Aurora4 task consists of 9.4 hours representation. It has been applied to various speech-related of4,620noisyutterancescorruptedbyoneof14differentnoise tasks, and our setup MTL is similar to these other multi-task types,whichcombine7differentbackgroundnoisetypes(street learningsolutions[20],exceptthatweareconsideringenviron- traffic,trainstation,car,babble,restaurant,airport,andclean) mentasthevariable. and2channeldistortions.Thenoiseadaptationfeaturesforthe Figure1bshowsthearchitectureofourMTLapproach. We Aurora4taskwereextractedfromthenetworkoptimizedonthe jointly optimize the network to predict the noise label while CHiME3trainingsetwithoutanyoftheenvironmentinforma- to predict the triphone states, so that the network can learn tionoftheAurora4task. noise-related structure. As a secondary task, the noise label We followed the standard way of representing speech by classification task is designed to predict the acoustic environ- mentaltypeYN fromthecurrentacousticobservationX. For using Kaldi toolkit [22] with their standard recipe. Every +5 and -5 consecutive MFCC feature frames are spliced together the fair comparison to our framework, +N DNN, we build the andprojecteddownto40dimensionsusingLDA,thenfMLLR same size of the network in which the two hidden layers are transformiscomputedontopofthefeatures. sharedacrosstwodifferenttask.Especiallywemakethesecond shared-hidden-layerhasthesamedimensionasthatofournoise 3.2. Systemtraining embeddingfeature,sothatthissecondshared-hidden-layercan serve as environmental noise information. Once the network To evaluate the proposed techniques, we built six different is optimized to minimize both the noise prediction error and systems: baseline, noise-aware-training +N NAT, the of- thetriphonestateserror,twoshared-hidden-layersandtheright fline i-vector framework +N GMM, the online i-vector frame- sideofthreehiddenlayersareusedforthedecoding. work+N GMM ON,ourproposedsystem,+N DNNandMTL. For our baseline, we trained the DNN acoustic model 3. Experiments withoutanyauxiliaryadaptationdata. Thenetworkcontains7 hidden layers that have 2,048 units each. We trained the net- 3.1. Dataset work using the cross-entropy objective with mini-batch based We investigate the performance of our noise embedding tech- stochasticgradientdescent(SGD).Wefollowedthesamebase- nique on three different databases, RM [15], CHiME-3 task line pipeline provided by the CHiME-3 organizer [16] and [16],Aurora4[17],intwomainways: in-domainnoiseexperi- matchedupWERwiththeofficialbaseline. ment,andunseenexperiment. In-domainnoiseexperiment,we For+N NAT,weestimatedthenoisethesamewayaspre- performtheexperimentsonthetestsetwiththesametypesof viouswork[9].Wesimplyaveragedthefirstandlasttenframes noiseswhenthemodelistrained. Fortheunseennoisetest,we ofeachutterance, creatinganestimatethatwasfixedoverthe trained the model on the CHiME-3 dataset, and then tested it utterance. withtheevaluationsetoftheAurora4task. Foranothercomparator+N GMMand+N GMM ON,wefol- We first evaluated our method on the in-domain experi- lowed the standard offline and the online i-vector extraction mentsonthenoisydatathathavebeenderivedfromRM.We method [14, 18]. We built a Universal Background Model artificiallymixedthecleanspeechwitheightdifferenttypesof (UBM)using2,048Gaussiansandextracteda40dimensional noisy background, including: whitenoise at 0 dB, and 10dB i-vectorofthecorrespondingnoisetype.Foronlinei-vector,we SNR, street noise at 0 dB, and 10 dB SNR, background mu- use10framesofspeechasawindow. sicat0dB,and10dB,andsimulatedreverberationwith1.0s Forourproposedmodel+N DNN,webuiltaDNNthathas reverberation time and 600 ms reverberation time. The street anarrowbottleneckhiddenlayer,allowingfortheextractionof noise and the background music segments was obtained from moretractable,high-levelnoisecontextinformation.Ithasfive Table 1: Comparison of WERs(%) between the baseline, Table 2: Comparison of WERs(%) on the CHiME-3 task N DNN,andMTLmodelusing50-dimensionalembeddingsfor (In-domain Noise 4.5hrs) and the Aurora4 task (Unseen 8differentnoisyevaluationsetsandonecleanevaluationset. Noise 9.4hrs) between the baseline, +N NAT, +N GMM, +N GMM ON,and+N DNN.40dimensionalnoiseembeddingss Testset(SNR/RT) baseline +NDNN MLT wereaugmentedfornoiseadaptation. Themodelsaretrained clean 3.0 2.9 3.1 onCHiME-3trainingdataset(18hrs).(*)denotesthestatistical music(00) 28.4 25.5 29.1 significance(α=0.05)[23]. music(10) 6.5 6.3 7.4 reverb(0.6) 16.4 15.4 17.4 reverb(1.0) 26.8 25.3 29.0 In-domainNoise(CHiME-3) UnseenNoise(Aurora4) street(00) 35.0 32.7 39.1 Model(CHiME-3) Dev(%) Eval(%) testeval92(%) street(10) 7.7 6.7 7.7 Baseline 8.9 15.6 11.7 white(00) 30.7 28.8 33.8 +NNAT 8.8 15.9* 12.6* white(10) 9.7 8.3 9.5 +NGMM 8.8 15.7 12.4* Average 18.3 16.9 19.5 +NGMMON 8.9 15.7 11.6* +NDNN 8.8 15.3* 11.5* hidden layers. The fourth layer is a bottleneck with 40 units. Otherlayershave1024unitseach. Oncethenetworkwasop- Table 2 compares the WER obtained using Baseline, timized,thediscriminativenoisefeaturesofeverytrainingand +N GMM, +N NAT, and +N DNN. We note that our approach testsetwereconcatenatedtoeachcorrespondingoriginalfea- +N DNNprovidedanadditional2.2%relativereductioninWER tureset.Unlikepreviousnoiseestimates[9],ournoisefeatures comparedtoBaseline. Also,itcanbeseenthattheperfor- were focused on capturing the background information opti- mance of +N NAT is highly relies on the dataset and it does mizedbydifferentobjectives, classifyingthenoisetypes, and notworkonCHiME-3task. Unlikespeakeradaptationresults, estimatingeveryinputframeswithoutassumingthatthenoise the+N GMMshowedworseperformancethanevenBaseline. isstationary. This result is due to insufficient noise diversity in noise i- Forthemulti-tasklearningsystem,MTL,wesharedtwolay- vectortrainingwhereasrelativelymoreavailablespeakerdiver- ersasdescribedinFigure1b.Forthefaircomparison,thenum- sity(e.g.87speakersareavailableinCHiME-3task) berofmodelparametersarematchedapproximately. Theright-mostcolumninTable2showsWERobtainedus- ingBaseline,+N NAT,+N GMM,+N GMM ON,and+N DNN. Although the improvement of the unseen noise case (relative 3.3. Results improvement: 0.9%) is less than the gain of the in-domain noise case (relative improvement: 2.2%), it is clear that our noise adaptation approach +N DNN is superior to other noise adaptation techniques. This result is also due to insufficient noise diversity, so we expect further improvement can be achieved by using additional noise types during model train- ing. Also, +N NAT (12.6%) and +N GMM (12.4%) are worse thanBaselineandthisresultsuggeststhatourproposedsys- temcouldbemorerobustadaptationtechniqueevenwhenthe testenvironmentsaremostlyunknown. Figure3: Comparisonofthefinalinputfeaturesofin-domain noise (RM) between baseline and +N DNN. The randomly 4. Conclusions selected100inputfeaturesprojectedin2-dimensionalspaceby We proposed a novel noise adaptation approach, N DNN, in LDA. whichwetrainaDeepNeuralNetworkthatdynamicallyadapts the speech recognition system to the environment in which it Beforeweevaluatedtherecognitionaccuracy, wefirstvi- is being used. We verified the effectiveness of our proposed sualized the final input features of different systems. Figure frameworkwithimprovedrecognitionaccuracyinnoisyenvi- 3showsthefinalinputfeatureofin-domainnoiseset(RM)of ronments. We also compared our approach to offline and on- baselineandN DNN.Thefigureshowsthataddingnoiseem- line i-vector framework N GMM, N GMM ON, the Noise-Aware beddingshelpstheinputfeaturesetbesignificantlymoredis- Training,N NAT,andMTL.Throughaseriesofexperimentson criminativewithrespecttothedifferentenvironments. Figure CHiME-3 task and Aurora4 task, we showed our model con- 2 shows the final input feature of unseen noise set (Aurora4 sistentlyimprovestheperformanceonbothin-domainandun- evaluationset)ofbaseline,+N NAT,+N GMM,and+N DNN. seennoisetestswithusingonlyfourdifferentnoisetypesduring The figure shows that the input features augmented with the training. noisefeaturebasedon+N DNNarerelativelymorediscrimina- In future work, we would scale learning across various tivewithrespecttothedifferentenvironmentsanditindicates noisydatatypes.Webelievefurtherperformanceimprovement thatthemodelisworkwellonevenunseennoisecase. eveninunseennoisyenvironmentscanbeachievedbyusingad- Table1comparestherecognitionaccuracyobtainedusing ditionalandmorediversenoisestocoverawiderrangeofnoise three models: baseline, MTL, and +N DNN. It can be seen variation. that at all SNRs and all noise types +N DNN outperforms the others even in clean datasets. We note that the improvements 5. Acknowledgment inrecognitionaccuracyaregreateratthelowerSNRs. Forex- Theauthorswouldliketoacknowledgethecontributionsmade ample,weobtained2.92%ofWERimprovementinthedataset byRichardM.Sternforhisvaluableandconstructivesugges- withbackgroundmusicat0dBSNR,whereasonly0.19%of tionsduringtheplanninganddevelopmentofthisproject. WERimprovementinthecleandataset. 6. References [19] F. Gre´zl et al., “Probabilistic and bottle-neck features for lvcsr ofmeetings,”inAcoustics,SpeechandSignalProcessing,2007. [1] P. J. Moreno, B. Raj, and R. M. Stern, “A vector taylor se- ICASSP2007.IEEEInternationalConferenceon,vol.4. IEEE, riesapproachforenvironment-independentspeechrecognition,” 2007,pp.IV–757. inAcoustics,Speech,andSignalProcessing,1996.ICASSP-96. Conference Proceedings., 1996 IEEE International Conference [20] M.L.SeltzerandJ.Droppo, “Multi-tasklearningindeepneu- on,vol.2. IEEE,1996,pp.733–736. ral networks for improved phoneme recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE Interna- [2] C.KimandR.M.Stern,“Power-normalizedcepstralcoefficients tionalConferenceon. IEEE,2013,pp.6965–6969. (pncc)forrobustspeechrecognition,” inAcoustics, Speechand Signal Processing (ICASSP), 2012 IEEE International Confer- [21] E. A. Habets, “Room impulse response generator,” Technische enceon. IEEE,2012,pp.4101–4104. UniversiteitEindhoven,Tech.Rep,vol.2,no.2.4,p.1,2006. [22] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, [3] ——,“Nonlinearenhancementofonsetforrobustspeechrecog- nition.”inINTERSPEECH,2010,pp.2058–2061. N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J.Silovsky,G.Stemmer,andK.Vesely,“Thekaldispeechrecog- [4] B.RajandR.M.Stern, “Missing-featureapproachesinspeech nition toolkit,” in IEEE 2011 Workshop on Automatic Speech recognition,”SignalProcessingMagazine,IEEE,vol.22,no.5, RecognitionandUnderstanding. IEEESignalProcessingSo- pp.101–116,2005. ciety,Dec.2011,iEEECatalogNo.:CFP11SRW-USB. [5] B.LiandK.C.Sim,“Improvingrobustnessofdeepneuralnet- [23] L.GillickandS.J.Cox,“Somestatisticalissuesinthecompari- worksviaspectralmaskingforautomaticspeechrecognition,”in sonofspeechrecognitionalgorithms,”inAcoustics,Speech,and AutomaticSpeechRecognitionandUnderstanding(ASRU),2013 SignalProcessing,1989.ICASSP-89.,1989InternationalConfer- IEEEWorkshopon. IEEE,2013,pp.279–284. enceon. IEEE,1989,pp.532–535. [6] A.NarayananandD.Wang,“Jointnoiseadaptivetrainingforro- bustautomaticspeechrecognition,”inAcoustics,SpeechandSig- nal Processing (ICASSP), 2014 IEEE International Conference on. IEEE,2014,pp.2504–2508. [7] M. J. Gales, “Maximum likelihood linear transformations for hmm-basedspeechrecognition,” Computerspeech&language, vol.12,no.2,pp.75–98,1998. [8] ——,“Semi-tiedcovariancematricesforhiddenmarkovmodels,” SpeechandAudioProcessing,IEEETransactionson,vol.7,no.3, pp.272–281,1999. [9] M. L. Seltzer, D. Yu, and Y. Wang, “An investigation of deep neuralnetworksfornoiserobustspeechrecognition,”inAcous- tics,SpeechandSignalProcessing(ICASSP),2013IEEEInter- nationalConferenceon. IEEE,2013,pp.7398–7402. [10] A.-r.Mohamed,G.E.Dahl,andG.Hinton,“Acousticmodeling usingdeepbeliefnetworks,”Audio,Speech,andLanguagePro- cessing,IEEETransactionson,vol.20,no.1,pp.14–22,2012. [11] G.Hinton,L.Deng,D.Yu,G.E.Dahl,A.-r.Mohamed,N.Jaitly, A.Senior,V.Vanhoucke,P.Nguyen,T.N.Sainathetal.,“Deep neuralnetworksforacousticmodelinginspeechrecognition:The sharedviewsoffourresearchgroups,”SignalProcessingMaga- zine,IEEE,vol.29,no.6,pp.82–97,2012. [12] F.Seide,G.Li,andD.Yu,“Conversationalspeechtranscription using context-dependent deep neural networks.” in Interspeech, 2011,pp.437–440. [13] Y.Liuetal.,“Aninvestigationintospeakerinformeddnnfront- end for lvcsr,” in Acoustics, Speech and Signal Processing (ICASSP),2015IEEEInternationalConferenceon. IEEE,2015, pp.4300–4304. [14] G.Saon,H.Soltau,D.Nahamoo,andM.Picheny,“Speakeradap- tationofneuralnetworkacousticmodelsusingi-vectors,”inAu- tomatic Speech Recognition and Understanding (ASRU), 2013 IEEEWorkshopon. IEEE,2013,pp.55–59. [15] “Price, p, et al. resource management rm2 2.0 ldc93s3c. dvd.philadelphia:.” LinguisticDataConsortium,1993. [16] E.V.S.W.JonBarker,RicardMarxer,“Thethird’chime’speech separation and recognition challenge: Dataset, task and base- lines,” Submitted to IEEE 2015 Automatic Speech Recognition andUnderstandingWorkshop(ASRU),2015. [17] N.PariharandJ.Picone,“Auroraworkinggroup: Dsrfrontend lvcsrevaluationau/384/02,”Inst.forSignalandInformationPro- cess,MississippiStateUniversity,Tech.Rep,vol.40,p.94,2002. [18] S.Madikeri,I.Himawan,P.Motlicek,andM.Ferras,“Integrat- ing online i-vector extractor with information bottleneck based speakerdiarizationsystem,”Idiap,Tech.Rep.,2015.