Table Of Content

Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24 An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music Transcription Systems Rainer Kelz ∗1 and Gerhard Widmer1 7 1 1Department of Computational Perception, Johannes Kepler University Linz, Austria 0 2 n February 2, 2017 a J 1 3 Abstract yt ∈{0,1}K being a vector of indicator variables and Kdenotingthetonalrange. Inotherwords,theyvec- ] Several recent polyphonic music transcription sys- tors specify the note pitches believed to be active in a D tems have utilized deep neural networks to achieve given audio frame x. Another simplifying assumption S stateoftheartresultsonvariousbenchmarkdatasets, is usually the presence of only a single instrument, . s pushing the envelope on framewise and note-level per- which more often than not turns out to be the piano, c [ formance measures. Unfortunately we can observe a having a tonal range of K=88. sort of glass ceiling effect. To investigate this effect, We will focus on framewise transcription systems 1 we provide a detailed analysis of the particular kinds only, as they turn out to be a crucial stage in the full v 5 oferrorsthatstateoftheartdeepneuraltranscription transcription process, especially in so called hybrid 2 systems make, when trained and tested on a piano systems that post-process the framewise output with 0 transcription task. We are ultimately forced to draw dynamic probabilistic models to extract the afore- 0 a rather disheartening conclusion: the networks seem mentioned tuples describing musical notes, such as 0 to learn combinations of notes, and have a hard time [13, 14]. . 2 generalizing to unseen combinations of notes. Fur- A diverse set of methods have been employed to 0 thermore, we speculate on various means to alleviate tackle the framewise transcription problem, with non- 7 this situation. negative matrix factorization being one of the more 1 : prominent methods. Smaragdis and Brown with v their seminal paper [15] using non-negative matrix i 1 Introduction X factorization (NMF) for polyphonic transcription al- r ready identified an undesirable property of the tech- a The problem of polyphonic transcription can be for- nique. NMF seeks to minimize the reconstruction mally described as the transformation of a time- error (cid:107)X−WH(cid:107) , where X ∈ RD×T is the vector orderedsequenceof(audio)samplesX=(x)T ,x ∈ N + t=0 t valued signal to reconstruct, W ∈ RD×d is the dic- X into a set of tuples (t ,t ,F ,A), describing start, + s e 0 tionary, H∈Rd×T are the activations in time of the end, fundamental frequency or pitch and optionally + bases and N(·) is a matrix norm. If no additional amplitude of the notes that were played. A slightly constraints are applied and no a priori knowledge is easier problem is framewise transcription, or tone- exploited, Smaragdis and Brown [15] note that the quantized multi-F estimation, where the output is 0 a time-ordered sequence Y = (y)T ,y ∈ Y, with method learns a dictionary of unique events, rather t=0 t than individual notes. Two remedies for this prob- ∗Addresscorrespondenceto: [email protected] lem are also named: either choose sets of notes in 1 Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24 such a way that from their intersection single notes as proposed in [7], which achieves state of the art re- can be identified, or present all individual notes in sults for framewise transcription on a popular bench- isolation, so a meaningful dictionary can be learned mark dataset. We also designed a much smaller ver- first. A similar effect is achievable if the dictionary sion of this network, which will be referred to as matrix is harmonically constrained. The latter two SmallConvNet. Additionally, we borrow an archi- methods seem to be popular choices in the literature tecture originally employed for medical image seg- [15, 1, 2, 3, 4, 6, 11, 16, 17, 8] to solve this problem mentation, called the UNet [12] and make two small for NMF. modifications to adapt it for our purposes. We call We conduct a simple experiment to examine theadaptedarchitectureAUNet. Itisabletodirectly whether neural networks trained for a piano tran- integrate information at different scales, which is ben- scription task suffer from the same disentanglement eficial for smoothing in the temporal direction, and problems,followedbyananalysisoftwoverydifferent identifyinggroupsofpartialsandtheirdistanceinthe neural network architectures and the extent to which frequency dimension. The precise definitions for all they exhibit this behavior. networks, as well as schematic drawings of the architectures for the ConvNet, the SmallConvNet and the AUNet canbefoundintheappendix. Thedefinitions 2 Methods are listed in tables 1, 2 and 3 whereas the schemata are depicted in figures 7a, 7b and 8 respectively. Lackingpropertheoreticalanalytictoolsforthemodel class of neural networks, we resort to empirical tools, namely computational experiments. We train sev- 3 Datasets eral deep neural networks in a supervised fashion for a framewise piano transcription task and analyze We use a synthetic dataset to conduct small scale their error behavior. Adhering very closely to al- experiments with the SmallConvNet. A subset of ready established model architectures, as exemplified the MAPS dataset [5] is used to train and test the in [14, 7], we deviate only in very few aspects, mostly ConvNet andtheAUNet models. TheMAPSdataset concerning hyperparameter choices that affect train- consists of several classical piano pieces, along with ing time but have little effect on performance. These isolated notes and common chords, rendered with 7 parametrized functions we learn are of the following different software synthesizers (samplers) in addition form: f :X →Y,withf inturnbeingcomposed net net to 2 Disklavier piano recordings, one with the micro- ofmultiplesimplerfunctions,commonlyreferredtoas phoneclosetothepiano,andonewiththemicrophone layers in the neural network literature. An example farther away and thus containing room acoustics. We of a network with an input, hidden and output layer now describe each subset in turn: would be f (x) = f (f (f (x;θ );θ );θ ), where net 3 2 1 3 2 3 f (z ;θ ) = σ(W z +b ) with θ = {W ,b } i i−1 i i i−1 i i i i having matching dimensions to fit the output zi−1 of 3.1 FLUID thepreviouslayer. σ(·)isanonlinearfunctionapplied elementwise. We note here that the functions fi may For focused computational experiments we synthe- actually have more than one input z, and it may also size two-note combinations and isolated notes. We be from layers other than the directly previous layer. only use notes within an 11 semitone range around a We do not explicitly model convolution as it can be referencepitch(C4/MIDI60),creating(cid:0)23(cid:1)=253two- 2 expressed as a matrix-matrix product, given W and note intervals. The onset and offset of the two notes z have the right shapes. are exactly synchronous. We use the free software We choose neural network architectures already es- samplerFluidsynth1 togetherwiththefreelyavailable tablished to work well for framewise transcription. Our first choice is exactly the ConvNet architecture 1www.fluidsynth.org 2 Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24 Fluid-R3-GM2 soundfont to render a dataset FLUID- it has only 5327 parameters, to make it approxi- COMBI where the train and validation sets both con- mately comparable to NMF with a dictionary matrix sist of the aforementioned intervals, whereas the test W∈R229×23 having 5267 parameters. We note that set contains individual notes only. For FLUID-ISOL, overfitting, fitting noise in the data, is not the real the individual notes are in the train and validation problem here, as the acoustic properties of the sound sets, whereas the test set contains the intervals. So sources are the same for train and test set. for both datasets the intersection of unique events in The general idea of this experiment is discovering train and test sets is the empty set. The error behav- to which extent the network is capable of detecting ior of the SmallConvNet on this dataset is discussed isolatednotes,ifallithaseverseenwerecombinations, in section 4.1. and vice versa. We can find a partial answer to this question in figures 1 and 2. The figures all show the proportionofframeswhereallnoteshavebeenexactly 3.2 MAPS-MUS identified, and contrast them with the proportion of frames in which notes have been added or omitted. This subset consists only of the rendered classical Thismeansthatthethreequantitiesdonotnecessarily piano pieces in the MAPS dataset. We adopt the sum to one, because notes could have been added and more realistic train-test scenario described in [14], some others omitted in a frame. which is referred to as Configuration-II. It is more In figure 1 we can observe that after seeing only realisticbecauseittrainsonlyonsyntheticrenderings, two-note intervals, the network is able to generalize and tests on the real piano recordings. We select the to isolated notes to some extent. While a surprising training set as all pieces from 6 synthesizers, the number of individual notes are transcribed perfectly, validation set is comprised of all renderings from a somenotesarestillnotrecognizedproperly. Forthese randomly selected 7th, and the test set is made up of notestheircompanionnotesfromthetrainsetarepre- all Disklavier recordings. We will refer to this dataset dictedassimultaneouslysounding,indicatingafailure as MAPS-MUS from now on. The respective error to disentangle note combinations during training. behaviors of the two larger models, the ConvNet and the AUNet on this dataset are discussed in section 4.2. We did not use the test set for conducting any 23isolatednotes 1.0 erroranalysis,otherthanmeasuringfinalperformance exact after model selection, to make sure that both models 0.8 added omitted abcethuianldlythacishiiesveexsptlaatieneodftihnedaerttarilesiunltsse.ctTiohne4ra.2ti.onale fraction00..46 0.2 4 Results 0.0 B4 A#4 A4 G#4 G4 F#4 F4 E4 D#4 D4 C#4 C4 B3 A#3 A3 G#3 G3 F#3 F3 E3 D#3 D3 C#3 4.1 SmallConvNet and FLUID Figure 1: For isolated notes present only in the test set, this is the proportion of exactly transcribed We start with a controlled empirical analysis of the frames, along with the proportions of frames that disentanglement problemusingoursyntheticdatasets. had notes added or omitted, respectively. Transcrip- We train the SmallConvNet for framewise transcrip- tionsstemfromtheSmallConvNet trainedonFLUID- tion on logarithmic filtered, log-magnitude spectro- COMBI. grams with 229 bins, as proposed in [7]. The output size of the network is limited to 23 notes, and In figure 2 we see that the network utterly fails to 2http://www.musescore.org/download/fluid-soundfont. generalize from isolated notes to note combinations, tar.gz withonlytwoexceptions. Weplottedonlythe23best 3 Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24 4.2 ConvNet, AUNet and MAPS- 23intervals 1.0 MUS exact 0.8 added fraction00..46 omitted Weveanntodwattausrent.ouWreatttreanintiosenvteoraal imnsotraenmceussoicfablloythreal- ConvNet andanAUNet,closelyadheringtothetrain- 0.2 ing procedure described in [7], and select the model 0.0 G4,A#4 B3,D4 D4,F#4 D#3,E3 G#3,B3 A3,C4 A#3,C#4 A3,A#3 G4,B4 D3,D#3 D#4,F#4 D3,F3 E3,F3 C#4,D4 F3,A#3 F#4,G4 C4,D#4 F#3,A#3 C#3,F3 D#4,A#4 F4,A#4 D#4,A4 F4,F#4 foonrathnealyvsailsidtahtaitonacsheiet.vesOhuirghaensatlfyrsaims eowfiesrerof-rmbeaehsuarve- ior is restricted to the validation set as well, simply Figure 2: For a selection of the 23 best transcribed because we want to avoid learning too much about intervals present only in the test set, this is the the composition of the MAPS test set. The scenario proportion of exactly transcribed frames, along with is the same, as the validation set consists of pieces the proportions of frames that had notes added or rendered by an unseen synthesizer. We feel that this omitted, respectively. Transcriptions stem from the also lends some additional strength to our argument, SmallConvNet trained on FLUID-ISOL. as we conduct our analysis on the best performing model for this set. Two different scenarios are considered. The first scenario looks at the transcription results for notes transcribed note combinations, as for the remaining and note combinations that are present in both the 230 intervals the proportion of omission errors is very train and validation set, referred to as “shared” com- closetoorevenexactly1.0. Thenetworkdoesmanage binations. A low proportion of additions will tell to transcribe two of the intervals with acceptable us that there were a sufficient number of examples accuracy,howeveranexplanationofwhyexactlythese for this particular combination, so it could not be two intervals could be recognized eludes us at the overshadowed by combinations containing additional moment. notes. A high proportion of omissions will indicate Wemightdrawapreliminaryconclusionfromthese issues with generalization to different acoustic prop- results: the strategy most successful for alleviating erties. If both proportions are high, this indicates the disentanglement problem for NMF, namely learn- that one or more notes in the combination have been ing the dictionary W from isolated notes, does not mistaken for others. work for neural transcription systems. The NMF of The second scenario examines the transcription re- spectrogramsisalinearsystem, andthereforehasthe sultsfornotesandnotecombinationsthatarepresent superposition property. Its response to multiple in- only in the validation set, referred to as “unshared”. puts is the sum of the responses for individual inputs. Iftheproportionofexactlytranscribedframesishigh, This is not necessarily true for neural networks, as the network must have learned to disentangle individ- they may learn to approximate a linear function, but ualnotesfromdifferentcombinationsshowntoit,and do not have to. be able to recognize these disentangled parts in new, The other strategy mentioned in [15], namely show- unseen combinations. A high proportion of additions ing combinations of notes to the networks, seems to will mainly tell us that the network has failed to dis- work fairly well for the majority of isolated notes, entangle parts, but still tries to combine the ones it as can be observed in figure 1. Unfortunately, the knows about. A high proportion of omissions points number of combinations for the tonal range of the to either a failure to simultaneously disentangle and piano grows large very quickly. Even when assuming recombine, a failure to generalize to different acoustic a maximum polyphony of only 6, we would already properties, or more probably both. need to show (cid:80)6 (cid:0)88(cid:1) = 41.621.206 combinations In figure 3 we can see two things: the most com- i=2 i to the network. mon note combinations present in both train and 4 Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24 20mostcommon,sharednotecombinations 20mostcommon,unsharednotecombinations 1.0 1.0 exact 0.8 exact 0.8 aodmditetedd fraction00..46 aodmditetedd 0.2 0.0 fraction0000....0246D5 F5 C5 E5 A#4 G5 D#5 D4 B4 G4 A4 A5 A#3 D#4 G#4 F4 G3 F#5 C4 A3 G2,D3,G3,A#3,D4 G2,G#3,B3,D4,F4 F#2,C#3,F#3,A#3,C#4,F#4,A#4,D#5 G2,G#3,B3,D4,G#4 C#4,G#4,C#5,F5,G#5,C#6,F6,G#6 E3,A3,C4,A4,E5 G2,G#3,G4 C#4,F#4,A#4,C#5,A#5,C#6 G3,A3,C4,D4 G3,B3,D4,F4,G4,B4,D5 D3,F#3,A3,C4,D4,F#4,A4 E3,G3,B3,B4,G5 F3,C4,D#4,F4 A#3,D4,F5 F#4,A4,C5,D#5 A#2,F3,A#3,D#4,F4,A#4,D#5 D3,A#3,D4,G4,D5 C#4,D#4,F#4,A#4 C#3,F#3,A#3,C#4,F#4,A#4,D#5 G3,A#4,D5,G5 Figure 3: For the most common note combinations Figure 4: The most common note combinations present both in the train set and validation set, this presentonly inthevalidationset,andtheproportion is the proportion of exactly transcribed frames, along of exactly transcribed frames, along with the propor- with the proportion of frames that had notes added tionofframesthathadnotesaddedoromitted,respec- or omitted, respectively. Transcriptions stem from tively. TranscriptionsstemfromtheConvNet trained the ConvNet trained on MAPS-MUS on MAPS-MUS. validation set are actually isolated notes, and the dation set, we can observe that the AUNet achieves relative frequency of exactly transcribed notes is com- marginally better exact transcription results across paratively high. Unfortunately, we can also see that the board. In some cases, the proportion of added the proportion of frames in which additional notes notes is reduced, however this happens at the were erroneously transcribed is much higher than we expense of a slightly increased amount of omit- would prefer, pointing to both a lack of examples for ted note combinations. Likewise, the results for these individual notes at train time and the failure to the ConvNet (figure 4) and AUNet transcriptions generalize from combinations. They all are confused (figure 6) for the “unshared” case appear to be very with combinations every so often. The low propor- similar, indicating a comparable error behavior across tion of omission errors for isolated notes indicate only very different architectures. mild difficulties to generalize to different acoustical Concluding this section we would like to emphasize properties. that both architectures achieve the same (or even Lookingatfigure4,wecanseetheerrorbehaviorof slightly exceed) framewise transcription results on the network for the most common note combinations the MAPS dataset as reported in [7], which currently thatareonlypresentinthevalidationset. Wenoticea defines the state of the art. In other words, it is large amount of omission errors - which also indicates unlikely that the problematic results reported above a failure to generalize to unseen note combinations. areduetothefactthatwemadepoorhyperparameter A few combinations, such as (G3, A3, C4, D4), stand choices. out though as being transcribed with great accuracy. We could find no satisfactory explanation for this so far, other than the suspicion it has to do with their 5 Summary low polyphony. If we compare the results of the ConvNet (figure 3) We have experimentally shown that certain neural and the AUNet (figure 5) for the most common note network architectures have difficulties disentangling combinations which are shared by the train and vali- inputs which are superpositions or mixtures of indi- 5 Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24 20mostcommon,sharednotecombinations determinedinasmallexperimentdescribedinsection 1.0 exact 4.1. Any approach that tries to learn from a fixed set added ofcombinations,forexampledefinedbyasetofmusic 0.8 omitted pieces, without incorporating additional constraints orpriorknowledge, asisdonein[13,14,7], willsuffer fraction0.6 froTmhethbisruptreofbolrecme.approach to solve the disentangle- 0.4 ment problem would be showing all possible combinations to the network. Unfortunately this solution 0.2 is intractable, due to the large tonal range and max- 0.0 imum polyphony of certain instruments. Arguably D5 F5 C5 E5 A#4 G5 D#5 D4 B4 G4 A4 A5 A#3 D#4 G#4 F4 G3 F#5 C4 A3 thisapproachwouldalsonotnecessarilyforcethenet- works to learn how to disentangle, as they could, in Figure 5: The most common note combinations principle, simply memorize all combinations. Learn- present both in the train set and validation set, and ing a different note detector for each note, as done in proportion of exactly transcribed frames, along with [9,10]suffersfromthesameproblems,ifthecombina- the proportion of frames that had notes added or tions shown to each detector are not diverse enough. omitted, respectively. Transcriptions stem from the Depending on the expressiveness of the model class, AUNet trained on MAPS-MUS. “diverseenough”couldeasilymean“allcombinations”. A partial solution to this problem might involve a modification of the loss function for the network. An 20mostcommon,unsharednotecombinations 1.0 additional objective must specify the need to disen- 0.8 exact fraction00..46 aodmditetedd ttaonlgelaernintdoivdiedcuoamlpnoosteesae(xnpolnicliintleya.r)Tmheixnteutrweoorfksingeneadlss 0.2 into its constituent parts - a task commonly known 0.0 G2,D3,G3,A#3,D4 G2,G#3,B3,D4,F4 F#2,C#3,F#3,A#3,C#4,F#4,A#4,D#5 G2,G#3,B3,D4,G#4 C#4,G#4,C#5,F5,G#5,C#6,F6,G#6 E3,A3,C4,A4,E5 G2,G#3,G4 C#4,F#4,A#4,C#5,A#5,C#6 G3,A3,C4,D4 G3,B3,D4,F4,G4,B4,D5 D3,F#3,A3,C4,D4,F#4,A4 E3,G3,B3,B4,G5 F3,C4,D#4,F4 A#3,D4,F5 F#4,A4,C5,D#5 A#2,F3,A#3,D#4,F4,A#4,D#5 D3,A#3,D4,G4,D5 C#4,D#4,F#4,A#4 C#3,F#3,A#3,C#4,F#4,A#4,D#5 G3,A#4,D5,G5 ajstAoaesipnnc“agtsrlkeoaomutnbiroejocnneecwtetspnievlrcpoeeoabudcrlreaoagmtmgieoibninsmign”t.iphneeegFnnitamnotldtpusyiilncttgiho-lafaatobfnseooglrlomvlioenussglsatehrtseiiosswendaiitroshcefhnaa.- Figure 6: The most common note combinations This work is supported by the European Research presentonly inthevalidationset,andtheproportion Council (ERC Grant Agreement 670035, project of exactly transcribed frames, along with the propor- CON ESPRESSIONE). The Tesla K40 used for this tionofframesthathadnotesaddedoromitted,respec- research was donated by the NVIDIA Corporation. tively. Transcriptions stem from the AUNet trained on MAPS-MUS. References vidual parts, as discussed in section 4.2. They learn [1] Emmanouil Benetos, Sebastian Ewert, and Till- to do so only if they are shown a large number of man Weyde. Automatic transcription of pitched combinations whose constituent parts overlap, and and unpitched sounds from polyphonic music. they utterly fail to generalize to combinations when In IEEE International Conference on Acoustics, trainedonindividualpartsofthemixturealone,aswe SpeechandSignalProcessing, ICASSP2014, Flo- 6 Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24 rence, Italy, May 4-9, 2014, pages 3107–3111, NewYorkCity, UnitedStates, August7-11, 2016, 2014. pages 475–481, 2016. [2] NancyBertin,RolandBadeau,andGaëlRichard. [8] Anis Khlif and Vidhyasaharan Sethu. An itera- Blind signal decompositions for automatic tran- tive multi range non-negative matrix factoriza- scription of polyphonic music: NMF and K-SVD tionalgorithmforpolyphonicmusictranscription. on the benchmark. In Proceedings of the IEEE In Proceedings of the 16th International Society International Conference on Acoustics, Speech, for Music Information Retrieval Conference, IS- and Signal Processing, ICASSP 2007, Honolulu, MIR 2015, Málaga, Spain, October 26-30, 2015, Hawaii, USA, April 15-20, 2007, pages 65–68, pages 330–335, 2015. 2007. [9] Matija Marolt. A connectionist approach to au- [3] Nancy Bertin, Roland Badeau, and Emmanuel tomatic transcription of polyphonic piano music. Vincent. Fast bayesian nmf algorithms enforcing IEEE Trans. Multimedia, 6(3):439–449, 2004. harmonicity and temporal continuity in polyphonic music transcription. In IEEE Workshop [10] Juhan Nam, Jiquan Ngiam, Honglak Lee, on Applications of Signal Processing to Audio and Malcolm Slaney. A classification-based and Acoustics, WASPAA ’09, New Paltz, NY, polyphonic piano transcription approach using USA, October 18-21, 2009, pages 29–32, 2009. learnedfeaturerepresentations. InProceedings of the12thInternationalSocietyforMusicInforma- [4] Arnaud Dessein, Arshia Cont, and Guillaume tion Retrieval Conference, ISMIR 2011, Miami, Lemaitre. Real-time polyphonic music transcrip- Florida, USA, October 24-28, 2011, pages 175– tion with non-negative matrix factorization and 180, 2011. beta-divergence. InProceedings of the 11th Inter- national Society for Music Information Retrieval [11] Ken O’Hanlon and Mark D. Plumbley. Poly- Conference, ISMIR 2010, Utrecht, Netherlands, phonic piano transcription using non-negative August 9-13, 2010, pages 489–494, 2010. matrixfactorisationwithgroupsparsity. InIEEE International Conference on Acoustics, Speech [5] Valentin Emiya, Roland Badeau, and Bertrand and Signal Processing, ICASSP 2014, Florence, David. Multipitch estimation of piano sounds Italy, May 4-9, 2014, pages 3112–3116, 2014. using a new probabilistic spectral smoothness principle. IEEE Trans. Audio, Speech & Lan- [12] Olaf Ronneberger, Philipp Fischer, and Thomas guage Processing, 18(6):1643–1654, 2010. Brox. U-net: Convolutionalnetworksforbiomed- ical image segmentation. In Medical Image [6] Graham Grindlay and Daniel P. W. Ellis. Computing and Computer-Assisted Intervention Multi-voice polyphonic music transcription us- - MICCAI 2015 - 18th International Conference ing eigeninstruments. In IEEE Workshop on Munich, Germany, October 5 - 9, 2015, Proceed- Applications of Signal Processing to Audio and ings, Part III, pages 234–241, 2015. Acoustics, WASPAA ’09, New Paltz, NY, USA, October 18-21, 2009, pages 53–56, 2009. [13] Siddharth Sigtia, Emmanouil Benetos, Nicolas Boulanger-Lewandowski, Tillman Weyde, Ar- [7] RainerKelz,MatthiasDorfer,FilipKorzeniowski, tur S. d’Avila Garcez, and Simon Dixon. A SebastianBo¨ck,AndreasArzt,andGerhardWid- hybrid recurrent neural network for music tran- mer. On the potential of simple framewise ap- scription. In 2015 IEEE International Confer- proaches to piano transcription. In Proceedings ence on Acoustics, Speech and Signal Processing, of the 17th International Society for Music In- ICASSP 2015, South Brisbane, Queensland, Aus- formation Retrieval Conference, ISMIR 2016, tralia, April 19-24, 2015, pages 2061–2065, 2015. 7 Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24 Appendix [14] Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon. An end-to-end neural network for polyphonic piano music transcription. 512 IEEE/ACM Trans. Audio, Speech & Language 5x229x32 Processing, 24(5):927–939, 2016. 3x227x32 5x229x8 5x229x8 3x111x64 88 [15] Paris Smaragdis and Judith C. Brown. Non- 3x111x8 23 negative matrix factorization for polyphonic mu- 1x53x8 16 1x26x8 sic transcription. In 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No. 03TH8684), page 177–180. IEEE, 2003. (a) ConvNet (b) SmallConvNet [16] Emmanuel Vincent, Nancy Bertin, and Roland Badeau. Adaptive harmonic spectral decomposi- Figure 7: The ConvNet and tion for multiple pitch estimation. IEEE Trans. SmallConvNet architectures. White boxes de- Audio, Speech & Language Processing, 18(3):528– note convolutional layers, black thick lines stand for 537, 2010. fully connected layers. Arrows show information flow, dashed lines indicate a max-pooling operation. [17] Felix Weninger, Christian Kirst, Björn W. Schuller, and Hans-Joachim Bungartz. A dis- criminative approach to polyphonic piano note Layer Output No.of transcription using supervised non-negative ma- Type Dimensions Params Input 1x5x229 trix factorization. In IEEE International Confer- Conv(Id) 32x5x229@3x3 288 ence on Acoustics, Speech and Signal Processing, BatchNorm 32x5x229 128 Relu 32x5x229 ICASSP 2013, Vancouver, BC, Canada, May Conv(Id) 32x3x227@3x3 9216 BatchNorm 32x3x227 128 26-31, 2013, pages 6–10, 2013. Relu 32x3x227 MaxPool 32x3x113@1x2 Dropout,p=0.25 32x3x113 Conv(Id) 64x1x111@3x3 18432 BatchNorm 64x1x111 256 Relu 64x1x111 MaxPool 64x1x55@1x2 Dropout,p=0.25 64x1x55 Dense(Id) 512 1802240 BatchNorm 512 2048 Relu 512 Dropout,p=0.5 512 Dense(Sigmoid) 88 45144 (cid:80)1877880 Table 1: The ConvNet Architecture 8 Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24 256x256x32 256x256x64 256x256x32 256x88x1 128x128x32 128x128x96 128x128x32 64x64x64 64x64x192 64x64x64 32x32x128 Figure 8: A schematic drawing of the AUNet architecture. White boxes denote convolutional layers, their width corresponds to the number of convolutional kernels, their height corresponds to the size of the resulting feature maps. Arrows show information flow, dashed lines indicate either a max-pooling operation, if the line goes from a higher to a lower box, or an upscaling operation if the line goes from a lower to a higher box. Grey boxes next to white boxes denote a concatenation of feature maps. We made two adaptations to the original UNet architecture. The first is the use of upscaling operations instead of deconvolutions, and the second adaptation is the last layer having convolutions with a large kernel width in the frequency direction. Layer Output No.of Type Dimensions Params Upscale 128x32x32 Input 1x256x256 Concat 192x32x32 Conv(Id) 32x256x256@3x3 288 Conv(Id) 128x32x32@3x3 221184 BatchNorm 32x256x256 128 BatchNorm 128x32x32 512 Layer Output No.of Elu 32x256x256 Elu 128x32x32 Type Dimensions Params Conv(Id) 32x256x256@3x3 9216 Conv(Id) 128x32x32@3x3 147456 Input 1x5x229 BatchNorm 32x256x256 128 BatchNorm 128x32x32 512 Conv(Id) 8x5x229@3x3 72 Elu 32x256x256 Elu 128x32x32 BatchNorm 8x5x229 32 MaxPool 32x128x128@2x2 Upscale 128x64x64 Relu 8x5x229 Conv(Id) 32x128x128@3x3 9216 Concat 192x64x64 Conv(Id) 8x3x227@3x3 576 BatchNorm 32x128x128 128 Conv(Id) 64x64x64@3x3 110592 BatchNorm 8x3x227 32 Elu 32x128x128 BatchNorm 64x64x64 256 Relu 8x3x227 Conv(Id) 32x128x128@3x3 9216 Elu 64x64x64 MaxPool 8x3x113@1x2 BatchNorm 32x128x128 128 Conv(Id) 64x64x64@3x3 36864 Dropout,p=0.25 8x3x113 Elu 32x128x128 BatchNorm 64x64x64 256 Conv(Id) 8x1x111@3x3 576 MaxPool 32x64x64@2x2 Elu 64x64x64 BatchNorm 8x1x111 32 Conv(Id) 64x64x64@3x3 18432 Upscale 64x128x128 Relu 8x1x111 BatchNorm 64x64x64 256 Concat 96x128x128 MaxPool 8x1x55@1x2 Elu 64x64x64 Conv(Id) 32x128x128@3x3 27648 Dropout,p=0.25 8x1x55 Conv(Id) 64x64x64@3x3 36864 BatchNorm 32x128x128 128 Conv(Id) 8x1x53@1x3 192 BatchNorm 64x64x64 256 Elu 32x128x128 BatchNorm 8x1x53 32 Elu 64x64x64 Conv(Id) 32x128x128@3x3 9216 Relu 8x1x53 MaxPool 64x32x32@2x2 BatchNorm 32x128x128 128 MaxPool 8x1x26@1x2 Conv(Id) 64x32x32@3x3 36864 Elu 32x128x128 Dropout,p=0.25 8x1x26 BatchNorm 64x32x32 256 Upscale 32x256x256 Dense(Id) 16 3328 Elu 64x32x32 Concat 64x256x256 BatchNorm 16 64 Conv(Id) 64x32x32@3x3 36864 Conv(Id) 32x256x256@3x3 18432 Relu 16 BatchNorm 64x32x32 256 BatchNorm 32x256x256 128 Dropout,p=0.5 16 Elu 64x32x32 Elu 32x256x256 Dense(Sigmoid) 23 391 MaxPool 64x16x16@2x2 (cid:80)5327 Conv(Id) 128x16x16@3x3 73728 Conv(Id) 32x256x128@3x3 9216 BatchNorm 32x256x128 128 BatchNorm 128x16x16 512 Elu 32x256x128 Table 2: The SmallConvNet Architecture Elu 128x16x16 Conv(Sigmoid) 1x256x88@1x41 1313 Conv(Id) 128x16x16@3x3 147456 (cid:80)964673 BatchNorm 128x16x16 512 Elu 128x16x16 Table 3: The AUNet Architecture 9

An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music Transcription Systems PDF

0.51 MB·

by Rainer Kelz

#journals #arxiv

Checking for file health...

Save to my drive

Quick download

Download

Download An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music Transcription Systems PDF Free - Full Version

by Rainer Kelz| 0.51

Download An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music Transcription Systems by Rainer Kelz in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music Transcription Systems

No description available for this book.

Detailed Information

Author:	Rainer Kelz
File Size:	0.51
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music Transcription Systems Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music Transcription Systems PDF?

Yes, on https://PDFdrive.to you can download An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music Transcription Systems by Rainer Kelz completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music Transcription Systems on my mobile device?

After downloading An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music Transcription Systems PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music Transcription Systems?

Yes, this is the complete PDF version of An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music Transcription Systems by Rainer Kelz. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music Transcription Systems PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.