Table Of ContentSubmitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24
An Experimental Analysis of the Entanglement Problem in
Neural-Network-based Music Transcription Systems
Rainer Kelz ∗1 and Gerhard Widmer1
7
1
1Department of Computational Perception, Johannes Kepler University Linz, Austria
0
2
n February 2, 2017
a
J
1
3 Abstract yt ∈{0,1}K being a vector of indicator variables and
Kdenotingthetonalrange. Inotherwords,theyvec-
] Several recent polyphonic music transcription sys- tors specify the note pitches believed to be active in a
D
tems have utilized deep neural networks to achieve given audio frame x. Another simplifying assumption
S
stateoftheartresultsonvariousbenchmarkdatasets, is usually the presence of only a single instrument,
.
s pushing the envelope on framewise and note-level per- which more often than not turns out to be the piano,
c
[ formance measures. Unfortunately we can observe a having a tonal range of K=88.
sort of glass ceiling effect. To investigate this effect, We will focus on framewise transcription systems
1
we provide a detailed analysis of the particular kinds only, as they turn out to be a crucial stage in the full
v
5 oferrorsthatstateoftheartdeepneuraltranscription transcription process, especially in so called hybrid
2 systems make, when trained and tested on a piano systems that post-process the framewise output with
0 transcription task. We are ultimately forced to draw dynamic probabilistic models to extract the afore-
0 a rather disheartening conclusion: the networks seem mentioned tuples describing musical notes, such as
0 to learn combinations of notes, and have a hard time [13, 14].
.
2 generalizing to unseen combinations of notes. Fur- A diverse set of methods have been employed to
0 thermore, we speculate on various means to alleviate tackle the framewise transcription problem, with non-
7
this situation. negative matrix factorization being one of the more
1
: prominent methods. Smaragdis and Brown with
v
their seminal paper [15] using non-negative matrix
i 1 Introduction
X factorization (NMF) for polyphonic transcription al-
r ready identified an undesirable property of the tech-
a The problem of polyphonic transcription can be for-
nique. NMF seeks to minimize the reconstruction
mally described as the transformation of a time- error (cid:107)X−WH(cid:107) , where X ∈ RD×T is the vector
orderedsequenceof(audio)samplesX=(x)T ,x ∈ N +
t=0 t valued signal to reconstruct, W ∈ RD×d is the dic-
X into a set of tuples (t ,t ,F ,A), describing start, +
s e 0 tionary, H∈Rd×T are the activations in time of the
end, fundamental frequency or pitch and optionally +
bases and N(·) is a matrix norm. If no additional
amplitude of the notes that were played. A slightly
constraints are applied and no a priori knowledge is
easier problem is framewise transcription, or tone-
exploited, Smaragdis and Brown [15] note that the
quantized multi-F estimation, where the output is
0
a time-ordered sequence Y = (y)T ,y ∈ Y, with method learns a dictionary of unique events, rather
t=0 t
than individual notes. Two remedies for this prob-
∗Addresscorrespondenceto: [email protected] lem are also named: either choose sets of notes in
1
Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24
such a way that from their intersection single notes as proposed in [7], which achieves state of the art re-
can be identified, or present all individual notes in sults for framewise transcription on a popular bench-
isolation, so a meaningful dictionary can be learned mark dataset. We also designed a much smaller ver-
first. A similar effect is achievable if the dictionary sion of this network, which will be referred to as
matrix is harmonically constrained. The latter two SmallConvNet. Additionally, we borrow an archi-
methods seem to be popular choices in the literature tecture originally employed for medical image seg-
[15, 1, 2, 3, 4, 6, 11, 16, 17, 8] to solve this problem mentation, called the UNet [12] and make two small
for NMF. modifications to adapt it for our purposes. We call
We conduct a simple experiment to examine theadaptedarchitectureAUNet. Itisabletodirectly
whether neural networks trained for a piano tran- integrate information at different scales, which is ben-
scription task suffer from the same disentanglement eficial for smoothing in the temporal direction, and
problems,followedbyananalysisoftwoverydifferent identifyinggroupsofpartialsandtheirdistanceinthe
neural network architectures and the extent to which frequency dimension. The precise definitions for all
they exhibit this behavior. networks, as well as schematic drawings of the archi-
tectures for the ConvNet, the SmallConvNet and the
AUNet canbefoundintheappendix. Thedefinitions
2 Methods
are listed in tables 1, 2 and 3 whereas the schemata
are depicted in figures 7a, 7b and 8 respectively.
Lackingpropertheoreticalanalytictoolsforthemodel
class of neural networks, we resort to empirical tools,
namely computational experiments. We train sev- 3 Datasets
eral deep neural networks in a supervised fashion
for a framewise piano transcription task and analyze
We use a synthetic dataset to conduct small scale
their error behavior. Adhering very closely to al-
experiments with the SmallConvNet. A subset of
ready established model architectures, as exemplified
the MAPS dataset [5] is used to train and test the
in [14, 7], we deviate only in very few aspects, mostly
ConvNet andtheAUNet models. TheMAPSdataset
concerning hyperparameter choices that affect train-
consists of several classical piano pieces, along with
ing time but have little effect on performance. These
isolated notes and common chords, rendered with 7
parametrized functions we learn are of the following
different software synthesizers (samplers) in addition
form: f :X →Y,withf inturnbeingcomposed
net net to 2 Disklavier piano recordings, one with the micro-
ofmultiplesimplerfunctions,commonlyreferredtoas
phoneclosetothepiano,andonewiththemicrophone
layers in the neural network literature. An example
farther away and thus containing room acoustics. We
of a network with an input, hidden and output layer
now describe each subset in turn:
would be f (x) = f (f (f (x;θ );θ );θ ), where
net 3 2 1 3 2 3
f (z ;θ ) = σ(W z +b ) with θ = {W ,b }
i i−1 i i i−1 i i i i
having matching dimensions to fit the output zi−1 of 3.1 FLUID
thepreviouslayer. σ(·)isanonlinearfunctionapplied
elementwise. We note here that the functions fi may For focused computational experiments we synthe-
actually have more than one input z, and it may also size two-note combinations and isolated notes. We
be from layers other than the directly previous layer. only use notes within an 11 semitone range around a
We do not explicitly model convolution as it can be referencepitch(C4/MIDI60),creating(cid:0)23(cid:1)=253two-
2
expressed as a matrix-matrix product, given W and note intervals. The onset and offset of the two notes
z have the right shapes. are exactly synchronous. We use the free software
We choose neural network architectures already es- samplerFluidsynth1 togetherwiththefreelyavailable
tablished to work well for framewise transcription.
Our first choice is exactly the ConvNet architecture 1www.fluidsynth.org
2
Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24
Fluid-R3-GM2 soundfont to render a dataset FLUID- it has only 5327 parameters, to make it approxi-
COMBI where the train and validation sets both con- mately comparable to NMF with a dictionary matrix
sist of the aforementioned intervals, whereas the test W∈R229×23 having 5267 parameters. We note that
set contains individual notes only. For FLUID-ISOL, overfitting, fitting noise in the data, is not the real
the individual notes are in the train and validation problem here, as the acoustic properties of the sound
sets, whereas the test set contains the intervals. So sources are the same for train and test set.
for both datasets the intersection of unique events in The general idea of this experiment is discovering
train and test sets is the empty set. The error behav- to which extent the network is capable of detecting
ior of the SmallConvNet on this dataset is discussed isolatednotes,ifallithaseverseenwerecombinations,
in section 4.1. and vice versa. We can find a partial answer to this
question in figures 1 and 2. The figures all show the
proportionofframeswhereallnoteshavebeenexactly
3.2 MAPS-MUS
identified, and contrast them with the proportion of
frames in which notes have been added or omitted.
This subset consists only of the rendered classical
Thismeansthatthethreequantitiesdonotnecessarily
piano pieces in the MAPS dataset. We adopt the
sum to one, because notes could have been added and
more realistic train-test scenario described in [14],
some others omitted in a frame.
which is referred to as Configuration-II. It is more
In figure 1 we can observe that after seeing only
realisticbecauseittrainsonlyonsyntheticrenderings,
two-note intervals, the network is able to generalize
and tests on the real piano recordings. We select the
to isolated notes to some extent. While a surprising
training set as all pieces from 6 synthesizers, the
number of individual notes are transcribed perfectly,
validation set is comprised of all renderings from a
somenotesarestillnotrecognizedproperly. Forthese
randomly selected 7th, and the test set is made up of
notestheircompanionnotesfromthetrainsetarepre-
all Disklavier recordings. We will refer to this dataset
dictedassimultaneouslysounding,indicatingafailure
as MAPS-MUS from now on. The respective error
to disentangle note combinations during training.
behaviors of the two larger models, the ConvNet and
the AUNet on this dataset are discussed in section
4.2. We did not use the test set for conducting any 23isolatednotes
1.0
erroranalysis,otherthanmeasuringfinalperformance exact
after model selection, to make sure that both models 0.8 added
omitted
abcethuianldlythacishiiesveexsptlaatieneodftihnedaerttarilesiunltsse.ctTiohne4ra.2ti.onale fraction00..46
0.2
4 Results 0.0
B4 A#4 A4 G#4 G4 F#4 F4 E4 D#4 D4 C#4 C4 B3 A#3 A3 G#3 G3 F#3 F3 E3 D#3 D3 C#3
4.1 SmallConvNet and FLUID Figure 1: For isolated notes present only in the
test set, this is the proportion of exactly transcribed
We start with a controlled empirical analysis of the
frames, along with the proportions of frames that
disentanglement problemusingoursyntheticdatasets.
had notes added or omitted, respectively. Transcrip-
We train the SmallConvNet for framewise transcrip-
tionsstemfromtheSmallConvNet trainedonFLUID-
tion on logarithmic filtered, log-magnitude spectro-
COMBI.
grams with 229 bins, as proposed in [7]. The out-
put size of the network is limited to 23 notes, and
In figure 2 we see that the network utterly fails to
2http://www.musescore.org/download/fluid-soundfont. generalize from isolated notes to note combinations,
tar.gz withonlytwoexceptions. Weplottedonlythe23best
3
Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24
4.2 ConvNet, AUNet and MAPS-
23intervals
1.0 MUS
exact
0.8 added
fraction00..46 omitted Weveanntodwattausrent.ouWreatttreanintiosenvteoraal imnsotraenmceussoicfablloythreal-
ConvNet andanAUNet,closelyadheringtothetrain-
0.2
ing procedure described in [7], and select the model
0.0
G4,A#4 B3,D4 D4,F#4 D#3,E3 G#3,B3 A3,C4 A#3,C#4 A3,A#3 G4,B4 D3,D#3 D#4,F#4 D3,F3 E3,F3 C#4,D4 F3,A#3 F#4,G4 C4,D#4 F#3,A#3 C#3,F3 D#4,A#4 F4,A#4 D#4,A4 F4,F#4 foonrathnealyvsailsidtahtaitonacsheiet.vesOhuirghaensatlfyrsaims eowfiesrerof-rmbeaehsuarve-
ior is restricted to the validation set as well, simply
Figure 2: For a selection of the 23 best transcribed because we want to avoid learning too much about
intervals present only in the test set, this is the the composition of the MAPS test set. The scenario
proportion of exactly transcribed frames, along with is the same, as the validation set consists of pieces
the proportions of frames that had notes added or rendered by an unseen synthesizer. We feel that this
omitted, respectively. Transcriptions stem from the also lends some additional strength to our argument,
SmallConvNet trained on FLUID-ISOL. as we conduct our analysis on the best performing
model for this set.
Two different scenarios are considered. The first
scenario looks at the transcription results for notes
transcribed note combinations, as for the remaining
and note combinations that are present in both the
230 intervals the proportion of omission errors is very
train and validation set, referred to as “shared” com-
closetoorevenexactly1.0. Thenetworkdoesmanage
binations. A low proportion of additions will tell
to transcribe two of the intervals with acceptable
us that there were a sufficient number of examples
accuracy,howeveranexplanationofwhyexactlythese
for this particular combination, so it could not be
two intervals could be recognized eludes us at the
overshadowed by combinations containing additional
moment.
notes. A high proportion of omissions will indicate
Wemightdrawapreliminaryconclusionfromthese issues with generalization to different acoustic prop-
results: the strategy most successful for alleviating erties. If both proportions are high, this indicates
the disentanglement problem for NMF, namely learn- that one or more notes in the combination have been
ing the dictionary W from isolated notes, does not mistaken for others.
work for neural transcription systems. The NMF of
The second scenario examines the transcription re-
spectrogramsisalinearsystem, andthereforehasthe
sultsfornotesandnotecombinationsthatarepresent
superposition property. Its response to multiple in-
only in the validation set, referred to as “unshared”.
puts is the sum of the responses for individual inputs.
Iftheproportionofexactlytranscribedframesishigh,
This is not necessarily true for neural networks, as
the network must have learned to disentangle individ-
they may learn to approximate a linear function, but
ualnotesfromdifferentcombinationsshowntoit,and
do not have to.
be able to recognize these disentangled parts in new,
The other strategy mentioned in [15], namely show- unseen combinations. A high proportion of additions
ing combinations of notes to the networks, seems to will mainly tell us that the network has failed to dis-
work fairly well for the majority of isolated notes, entangle parts, but still tries to combine the ones it
as can be observed in figure 1. Unfortunately, the knows about. A high proportion of omissions points
number of combinations for the tonal range of the to either a failure to simultaneously disentangle and
piano grows large very quickly. Even when assuming recombine, a failure to generalize to different acoustic
a maximum polyphony of only 6, we would already properties, or more probably both.
need to show (cid:80)6 (cid:0)88(cid:1) = 41.621.206 combinations In figure 3 we can see two things: the most com-
i=2 i
to the network. mon note combinations present in both train and
4
Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24
20mostcommon,sharednotecombinations 20mostcommon,unsharednotecombinations
1.0
1.0 exact 0.8 exact
0.8 aodmditetedd fraction00..46 aodmditetedd
0.2
0.0
fraction0000....0246D5 F5 C5 E5 A#4 G5 D#5 D4 B4 G4 A4 A5 A#3 D#4 G#4 F4 G3 F#5 C4 A3 G2,D3,G3,A#3,D4 G2,G#3,B3,D4,F4 F#2,C#3,F#3,A#3,C#4,F#4,A#4,D#5 G2,G#3,B3,D4,G#4 C#4,G#4,C#5,F5,G#5,C#6,F6,G#6 E3,A3,C4,A4,E5 G2,G#3,G4 C#4,F#4,A#4,C#5,A#5,C#6 G3,A3,C4,D4 G3,B3,D4,F4,G4,B4,D5 D3,F#3,A3,C4,D4,F#4,A4 E3,G3,B3,B4,G5 F3,C4,D#4,F4 A#3,D4,F5 F#4,A4,C5,D#5 A#2,F3,A#3,D#4,F4,A#4,D#5 D3,A#3,D4,G4,D5 C#4,D#4,F#4,A#4 C#3,F#3,A#3,C#4,F#4,A#4,D#5 G3,A#4,D5,G5
Figure 3: For the most common note combinations Figure 4: The most common note combinations
present both in the train set and validation set, this presentonly inthevalidationset,andtheproportion
is the proportion of exactly transcribed frames, along of exactly transcribed frames, along with the propor-
with the proportion of frames that had notes added tionofframesthathadnotesaddedoromitted,respec-
or omitted, respectively. Transcriptions stem from tively. TranscriptionsstemfromtheConvNet trained
the ConvNet trained on MAPS-MUS on MAPS-MUS.
validation set are actually isolated notes, and the dation set, we can observe that the AUNet achieves
relative frequency of exactly transcribed notes is com- marginally better exact transcription results across
paratively high. Unfortunately, we can also see that the board. In some cases, the proportion of added
the proportion of frames in which additional notes notes is reduced, however this happens at the
were erroneously transcribed is much higher than we expense of a slightly increased amount of omit-
would prefer, pointing to both a lack of examples for ted note combinations. Likewise, the results for
these individual notes at train time and the failure to the ConvNet (figure 4) and AUNet transcriptions
generalize from combinations. They all are confused (figure 6) for the “unshared” case appear to be very
with combinations every so often. The low propor- similar, indicating a comparable error behavior across
tion of omission errors for isolated notes indicate only very different architectures.
mild difficulties to generalize to different acoustical Concluding this section we would like to emphasize
properties. that both architectures achieve the same (or even
Lookingatfigure4,wecanseetheerrorbehaviorof slightly exceed) framewise transcription results on
the network for the most common note combinations the MAPS dataset as reported in [7], which currently
thatareonlypresentinthevalidationset. Wenoticea defines the state of the art. In other words, it is
large amount of omission errors - which also indicates unlikely that the problematic results reported above
a failure to generalize to unseen note combinations. areduetothefactthatwemadepoorhyperparameter
A few combinations, such as (G3, A3, C4, D4), stand choices.
out though as being transcribed with great accuracy.
We could find no satisfactory explanation for this so
far, other than the suspicion it has to do with their 5 Summary
low polyphony.
If we compare the results of the ConvNet (figure 3) We have experimentally shown that certain neural
and the AUNet (figure 5) for the most common note network architectures have difficulties disentangling
combinations which are shared by the train and vali- inputs which are superpositions or mixtures of indi-
5
Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24
20mostcommon,sharednotecombinations determinedinasmallexperimentdescribedinsection
1.0 exact 4.1. Any approach that tries to learn from a fixed set
added ofcombinations,forexampledefinedbyasetofmusic
0.8 omitted pieces, without incorporating additional constraints
orpriorknowledge, asisdonein[13,14,7], willsuffer
fraction0.6 froTmhethbisruptreofbolrecme.approach to solve the disentangle-
0.4
ment problem would be showing all possible combi-
nations to the network. Unfortunately this solution
0.2
is intractable, due to the large tonal range and max-
0.0 imum polyphony of certain instruments. Arguably
D5 F5 C5 E5 A#4 G5 D#5 D4 B4 G4 A4 A5 A#3 D#4 G#4 F4 G3 F#5 C4 A3 thisapproachwouldalsonotnecessarilyforcethenet-
works to learn how to disentangle, as they could, in
Figure 5: The most common note combinations
principle, simply memorize all combinations. Learn-
present both in the train set and validation set, and
ing a different note detector for each note, as done in
proportion of exactly transcribed frames, along with
[9,10]suffersfromthesameproblems,ifthecombina-
the proportion of frames that had notes added or
tions shown to each detector are not diverse enough.
omitted, respectively. Transcriptions stem from the
Depending on the expressiveness of the model class,
AUNet trained on MAPS-MUS.
“diverseenough”couldeasilymean“allcombinations”.
A partial solution to this problem might involve a
modification of the loss function for the network. An
20mostcommon,unsharednotecombinations
1.0 additional objective must specify the need to disen-
0.8 exact
fraction00..46 aodmditetedd ttaonlgelaernintdoivdiedcuoamlpnoosteesae(xnpolnicliintleya.r)Tmheixnteutrweoorfksingeneadlss
0.2
into its constituent parts - a task commonly known
0.0
G2,D3,G3,A#3,D4 G2,G#3,B3,D4,F4 F#2,C#3,F#3,A#3,C#4,F#4,A#4,D#5 G2,G#3,B3,D4,G#4 C#4,G#4,C#5,F5,G#5,C#6,F6,G#6 E3,A3,C4,A4,E5 G2,G#3,G4 C#4,F#4,A#4,C#5,A#5,C#6 G3,A3,C4,D4 G3,B3,D4,F4,G4,B4,D5 D3,F#3,A3,C4,D4,F#4,A4 E3,G3,B3,B4,G5 F3,C4,D#4,F4 A#3,D4,F5 F#4,A4,C5,D#5 A#2,F3,A#3,D#4,F4,A#4,D#5 D3,A#3,D4,G4,D5 C#4,D#4,F#4,A#4 C#3,F#3,A#3,C#4,F#4,A#4,D#5 G3,A#4,D5,G5 ajstAoaesipnnc“agtsrlkeoaomutnbiroejocnneecwtetspnievlrcpoeeoabudcrlreaoagmtmgieoibninsmign”t.iphneeegFnnitamnotldtpusyiilncttgiho-lafaatobfnseooglrlomvlioenussglsatehrtseiiosswendaiitroshcefhnaa.-
Figure 6: The most common note combinations This work is supported by the European Research
presentonly inthevalidationset,andtheproportion Council (ERC Grant Agreement 670035, project
of exactly transcribed frames, along with the propor- CON ESPRESSIONE). The Tesla K40 used for this
tionofframesthathadnotesaddedoromitted,respec- research was donated by the NVIDIA Corporation.
tively. Transcriptions stem from the AUNet trained
on MAPS-MUS.
References
vidual parts, as discussed in section 4.2. They learn [1] Emmanouil Benetos, Sebastian Ewert, and Till-
to do so only if they are shown a large number of man Weyde. Automatic transcription of pitched
combinations whose constituent parts overlap, and and unpitched sounds from polyphonic music.
they utterly fail to generalize to combinations when In IEEE International Conference on Acoustics,
trainedonindividualpartsofthemixturealone,aswe SpeechandSignalProcessing, ICASSP2014, Flo-
6
Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24
rence, Italy, May 4-9, 2014, pages 3107–3111, NewYorkCity, UnitedStates, August7-11, 2016,
2014. pages 475–481, 2016.
[2] NancyBertin,RolandBadeau,andGa¨elRichard. [8] Anis Khlif and Vidhyasaharan Sethu. An itera-
Blind signal decompositions for automatic tran- tive multi range non-negative matrix factoriza-
scription of polyphonic music: NMF and K-SVD tionalgorithmforpolyphonicmusictranscription.
on the benchmark. In Proceedings of the IEEE In Proceedings of the 16th International Society
International Conference on Acoustics, Speech, for Music Information Retrieval Conference, IS-
and Signal Processing, ICASSP 2007, Honolulu, MIR 2015, M´alaga, Spain, October 26-30, 2015,
Hawaii, USA, April 15-20, 2007, pages 65–68, pages 330–335, 2015.
2007.
[9] Matija Marolt. A connectionist approach to au-
[3] Nancy Bertin, Roland Badeau, and Emmanuel tomatic transcription of polyphonic piano music.
Vincent. Fast bayesian nmf algorithms enforcing IEEE Trans. Multimedia, 6(3):439–449, 2004.
harmonicity and temporal continuity in poly-
phonic music transcription. In IEEE Workshop [10] Juhan Nam, Jiquan Ngiam, Honglak Lee,
on Applications of Signal Processing to Audio and Malcolm Slaney. A classification-based
and Acoustics, WASPAA ’09, New Paltz, NY, polyphonic piano transcription approach using
USA, October 18-21, 2009, pages 29–32, 2009. learnedfeaturerepresentations. InProceedings of
the12thInternationalSocietyforMusicInforma-
[4] Arnaud Dessein, Arshia Cont, and Guillaume tion Retrieval Conference, ISMIR 2011, Miami,
Lemaitre. Real-time polyphonic music transcrip- Florida, USA, October 24-28, 2011, pages 175–
tion with non-negative matrix factorization and 180, 2011.
beta-divergence. InProceedings of the 11th Inter-
national Society for Music Information Retrieval [11] Ken O’Hanlon and Mark D. Plumbley. Poly-
Conference, ISMIR 2010, Utrecht, Netherlands, phonic piano transcription using non-negative
August 9-13, 2010, pages 489–494, 2010. matrixfactorisationwithgroupsparsity. InIEEE
International Conference on Acoustics, Speech
[5] Valentin Emiya, Roland Badeau, and Bertrand and Signal Processing, ICASSP 2014, Florence,
David. Multipitch estimation of piano sounds Italy, May 4-9, 2014, pages 3112–3116, 2014.
using a new probabilistic spectral smoothness
principle. IEEE Trans. Audio, Speech & Lan- [12] Olaf Ronneberger, Philipp Fischer, and Thomas
guage Processing, 18(6):1643–1654, 2010. Brox. U-net: Convolutionalnetworksforbiomed-
ical image segmentation. In Medical Image
[6] Graham Grindlay and Daniel P. W. Ellis. Computing and Computer-Assisted Intervention
Multi-voice polyphonic music transcription us- - MICCAI 2015 - 18th International Conference
ing eigeninstruments. In IEEE Workshop on Munich, Germany, October 5 - 9, 2015, Proceed-
Applications of Signal Processing to Audio and ings, Part III, pages 234–241, 2015.
Acoustics, WASPAA ’09, New Paltz, NY, USA,
October 18-21, 2009, pages 53–56, 2009. [13] Siddharth Sigtia, Emmanouil Benetos, Nicolas
Boulanger-Lewandowski, Tillman Weyde, Ar-
[7] RainerKelz,MatthiasDorfer,FilipKorzeniowski, tur S. d’Avila Garcez, and Simon Dixon. A
SebastianBo¨ck,AndreasArzt,andGerhardWid- hybrid recurrent neural network for music tran-
mer. On the potential of simple framewise ap- scription. In 2015 IEEE International Confer-
proaches to piano transcription. In Proceedings ence on Acoustics, Speech and Signal Processing,
of the 17th International Society for Music In- ICASSP 2015, South Brisbane, Queensland, Aus-
formation Retrieval Conference, ISMIR 2016, tralia, April 19-24, 2015, pages 2061–2065, 2015.
7
Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24
Appendix
[14] Siddharth Sigtia, Emmanouil Benetos, and
Simon Dixon. An end-to-end neural net-
work for polyphonic piano music transcription.
512
IEEE/ACM Trans. Audio, Speech & Language
5x229x32
Processing, 24(5):927–939, 2016. 3x227x32 5x229x8 5x229x8
3x111x64 88
[15] Paris Smaragdis and Judith C. Brown. Non- 3x111x8
23
negative matrix factorization for polyphonic mu- 1x53x8 16
1x26x8
sic transcription. In 2003 IEEE Workshop on
Applications of Signal Processing to Audio and
Acoustics (IEEE Cat. No. 03TH8684), page
177–180. IEEE, 2003.
(a) ConvNet (b) SmallConvNet
[16] Emmanuel Vincent, Nancy Bertin, and Roland
Badeau. Adaptive harmonic spectral decomposi- Figure 7: The ConvNet and
tion for multiple pitch estimation. IEEE Trans. SmallConvNet architectures. White boxes de-
Audio, Speech & Language Processing, 18(3):528– note convolutional layers, black thick lines stand for
537, 2010. fully connected layers. Arrows show information flow,
dashed lines indicate a max-pooling operation.
[17] Felix Weninger, Christian Kirst, Bj¨orn W.
Schuller, and Hans-Joachim Bungartz. A dis-
criminative approach to polyphonic piano note
Layer Output No.of
transcription using supervised non-negative ma- Type Dimensions Params
Input 1x5x229
trix factorization. In IEEE International Confer- Conv(Id) 32x5x229@3x3 288
ence on Acoustics, Speech and Signal Processing, BatchNorm 32x5x229 128
Relu 32x5x229
ICASSP 2013, Vancouver, BC, Canada, May Conv(Id) 32x3x227@3x3 9216
BatchNorm 32x3x227 128
26-31, 2013, pages 6–10, 2013. Relu 32x3x227
MaxPool 32x3x113@1x2
Dropout,p=0.25 32x3x113
Conv(Id) 64x1x111@3x3 18432
BatchNorm 64x1x111 256
Relu 64x1x111
MaxPool 64x1x55@1x2
Dropout,p=0.25 64x1x55
Dense(Id) 512 1802240
BatchNorm 512 2048
Relu 512
Dropout,p=0.5 512
Dense(Sigmoid) 88 45144
(cid:80)1877880
Table 1: The ConvNet Architecture
8
Submitted to AES Conference on Semantic Audio, Erlangen, Germany, 2017 June 22 – 24
256x256x32 256x256x64 256x256x32
256x88x1
128x128x32 128x128x96 128x128x32
64x64x64 64x64x192 64x64x64
32x32x128
Figure 8: A schematic drawing of the AUNet architecture. White boxes denote convolutional layers, their
width corresponds to the number of convolutional kernels, their height corresponds to the size of the resulting
feature maps. Arrows show information flow, dashed lines indicate either a max-pooling operation, if the line
goes from a higher to a lower box, or an upscaling operation if the line goes from a lower to a higher box.
Grey boxes next to white boxes denote a concatenation of feature maps. We made two adaptations to the
original UNet architecture. The first is the use of upscaling operations instead of deconvolutions, and the
second adaptation is the last layer having convolutions with a large kernel width in the frequency direction.
Layer Output No.of
Type Dimensions Params
Upscale 128x32x32
Input 1x256x256
Concat 192x32x32
Conv(Id) 32x256x256@3x3 288
Conv(Id) 128x32x32@3x3 221184
BatchNorm 32x256x256 128
BatchNorm 128x32x32 512
Layer Output No.of Elu 32x256x256
Elu 128x32x32
Type Dimensions Params Conv(Id) 32x256x256@3x3 9216
Conv(Id) 128x32x32@3x3 147456
Input 1x5x229 BatchNorm 32x256x256 128
BatchNorm 128x32x32 512
Conv(Id) 8x5x229@3x3 72 Elu 32x256x256
Elu 128x32x32
BatchNorm 8x5x229 32 MaxPool 32x128x128@2x2
Upscale 128x64x64
Relu 8x5x229 Conv(Id) 32x128x128@3x3 9216
Concat 192x64x64
Conv(Id) 8x3x227@3x3 576 BatchNorm 32x128x128 128
Conv(Id) 64x64x64@3x3 110592
BatchNorm 8x3x227 32 Elu 32x128x128
BatchNorm 64x64x64 256
Relu 8x3x227 Conv(Id) 32x128x128@3x3 9216
Elu 64x64x64
MaxPool 8x3x113@1x2 BatchNorm 32x128x128 128
Conv(Id) 64x64x64@3x3 36864
Dropout,p=0.25 8x3x113 Elu 32x128x128
BatchNorm 64x64x64 256
Conv(Id) 8x1x111@3x3 576 MaxPool 32x64x64@2x2
Elu 64x64x64
BatchNorm 8x1x111 32 Conv(Id) 64x64x64@3x3 18432
Upscale 64x128x128
Relu 8x1x111 BatchNorm 64x64x64 256
Concat 96x128x128
MaxPool 8x1x55@1x2 Elu 64x64x64
Conv(Id) 32x128x128@3x3 27648
Dropout,p=0.25 8x1x55 Conv(Id) 64x64x64@3x3 36864
BatchNorm 32x128x128 128
Conv(Id) 8x1x53@1x3 192 BatchNorm 64x64x64 256
Elu 32x128x128
BatchNorm 8x1x53 32 Elu 64x64x64
Conv(Id) 32x128x128@3x3 9216
Relu 8x1x53 MaxPool 64x32x32@2x2
BatchNorm 32x128x128 128
MaxPool 8x1x26@1x2 Conv(Id) 64x32x32@3x3 36864
Elu 32x128x128
Dropout,p=0.25 8x1x26 BatchNorm 64x32x32 256
Upscale 32x256x256
Dense(Id) 16 3328 Elu 64x32x32
Concat 64x256x256
BatchNorm 16 64 Conv(Id) 64x32x32@3x3 36864
Conv(Id) 32x256x256@3x3 18432
Relu 16 BatchNorm 64x32x32 256
BatchNorm 32x256x256 128
Dropout,p=0.5 16 Elu 64x32x32
Elu 32x256x256
Dense(Sigmoid) 23 391 MaxPool 64x16x16@2x2
(cid:80)5327 Conv(Id) 128x16x16@3x3 73728 Conv(Id) 32x256x128@3x3 9216
BatchNorm 32x256x128 128
BatchNorm 128x16x16 512
Elu 32x256x128
Table 2: The SmallConvNet Architecture Elu 128x16x16 Conv(Sigmoid) 1x256x88@1x41 1313
Conv(Id) 128x16x16@3x3 147456 (cid:80)964673
BatchNorm 128x16x16 512
Elu 128x16x16
Table 3: The AUNet Architecture
9