Table Of Content2014 IEEE XXXIV International Scientific Conference Electronics and Nanotechnology (ELNANO)
On existance of optimal boundary value between
early reflections and late reverberation
Arkadiy Prodeus Olga Ladoshko
Acoustic and Electroacoustic Department Acoustic and Electroacoustic Department
Faculty of Electronics, NTUU KPI Faculty of Electronics, NTUU KPI
Kyiv, Ukraine Kyiv, Ukraine
[email protected] [email protected]
Abstract—Enhancement of speech distorted by reverberation communication systems when using a relatively simple pre-
is issue of the day. The problem has been actively studied in the processor of speech dereverberation, proposed by authors.
last decade. However, it is still extremely difficult to find clear
recommendations on choice of boundary value between early
II. REVERBERATION MODEL
reflections and late reverberation, optimal in sense of such
criteria as speech recognition accuracy and speech quality. The reverberant signal y(t)results from the convolution of
Another problem is getting of simple pre-processor of speech
the anechoic speech signal x(t) and the causal time-invariant
dereverberation. The problems are investigated in the paper.
Acoustic Impulse Response (AIR) h(t):
Keywords— speech enhancement; late reverberation;
dereverberation
∞
y(t)= ∫h(v)x(t−v)dv= x(t)⊗h(t),
I. INTRODUCTION 0
Problem of speech dereverberation in communication and
automatic speech recognition (ASR) systems is issue of the were ⊗ is convolution symbol.
day [1-4]. This problem was especially actively investigated in When selecting in AIR h(t) (Fig. 1) regions corresponding
the last decade due to the rapid development of mobile to early reflections and late reflections
communications. It was found that late reverberation is main
detrimental factor which may be interpreted as kind of noise.
h(t), 0≤t≤T ; h(t+T ), t≥0;
Unfortunately, strong non-stationarity of late reverberation h (t)= l h (t)= l
makes ineffective traditional techniques of noise suppression i 0, др.t, l 0, др.t,
[1], because these techniques are designed for stationary or
slow non-stationary noise. reverberation action can be described as
At the same time it was found that late reverberation
y(t)=h (t)⊗x(t)+h (t)⊗x(t−T )=h (t)⊗x(t)+r(t), (1)
power spectrum may be relatively easily estimated when i l l i
Polack’s statistical reverberation model is chosen [2]. The
formula for such estimation is simple both for calculation and where r(t) is component due to late reverberation; Tl is time,
for understanding. But the formula contains parameter T , corresponding to boundary between early reflections and late
l
reverberation (see Fig. 1).
which is time corresponding to boundary between early
When comparing model (1) with additive noise model
reflections and late reverberation. The problem is that the
boundary is blurred: we find T ≈30...60 ms in [2] and
l y(t)=x(t)+n(t),
40…100 ms in [3]. Moreover, these values were
experimentally obtained when problems of speech
where n(t) is stationary stochastic process, it has become
intelligibility and musical clarity were investigated [5], and it
isn’t evident that the same values will be good for speech clear why late reverberation may be interpreted as kind of
recognition and communication systems. noise. Unfortunately, strong non-stationarity of late
reverberation makes ineffective traditional techniques of noise
The objective of this paper is an investigation of existence
suppression [1], because these techniques are designed for
of parameter T optimal values in sense of such criteria as
l stationary or slow non-stationary noise. At the same time,
speech recognition accuracy and speech quality. Another
influence of early reflections, described with convolution of
objective is performance evaluation of speech recognition and
signal x(t) and AIR h (t), may be compensated in ASR
i
978-1-4799-4580-1/14/$31.00 ©2014 IEEE
2014 IEEE XXXIV International Scientific Conference Electronics and Nanotechnology (ELNANO)
systems by standard techniques, such as, for example, mean • priori SNR ξ(l,k) estimation.
cepstral normalization [2].
When modifying scheme of noise suppression, which is
made in accordance with (2)-(4), let's just substitute late
reverberation spectrum λ (l,k) assessment instead of noise
r
spectrum λ (l,k) assessment (this assessment unit is marked
n
out in bold line), as it is shown at Fig. 2.
For distances between speech source and microphone,
which are more then critical distance D , late reverberation
c
power spectrum λ (l,k) may be calculated by spectrum
r
λ (l,k) of signal y(t) [2]:
y
Fig. 1. Room AIR structure λr(l,k)=e−2δ(k)Tl ⋅λy(l−Nl,k), (5)
III. PROPOSED PRE-PROCESSOR OF DEREVERBERATION
where N =T F /R; R denotes the frame rate in samples of
Let us show that late reverberation suppression procedure l l s
the short-time Fourier transform (STFT);
may be realized almost by the same remedies which are
usually using for noise suppression. The only distinction δ(k)=2ln10 T60(k); T60(k) is reverberation time.
consists in estimation of late reverberation spectrum instead of The meaning of (5) is quite simple: the current speech
noise spectrum. sounds are masked by previous sounds of speech.
Correction in frequency field is one of the most spread
Begin
approaches to noise suppression [1, 2]:
Method parameters selection
λˆ12(l,k)=G(l,k)λ12(l,k), (2) Splitting signal up into frames
x y
Calculate power spectrum in frames
where λ (l,k) is power spectrum of l-th signal y(t) frame
y
Calculate late reverberation power spectrum in
at frequency f =kF /N ; F is sampling frequency; N
k s fft s fft framesagainst of noise power spectrum
ˆ
is FFT parameter; k is number of frequency sample; λ (l,k)
x
Calculate gain of enhancement filter
is power spectrum estimator of l-th frame of signal x(t) for
Calculate spectrum of enhanced signal
k-th frequency sample; G(l,k) is correction filter gain for l- Calculate IFFT for spectrum in frames
Frames merging in time domain
th signal y(t) frame for k-th frequency sample.
Without loss of conclusions generality, let us consider, for
End
determinacy, logMMSE method [6], for which enhancement
filter gain is Fig. 2. Proposed pre-processor of dereverberation
Smoothing is necessary to enhance the estimation accuracy
ξ(l,k) 1 ∞ e−t
G(l,k)=1+ξ(l,k)exp2 ∫ t dt, (3) of the spectrum λy(l,k) [2]:
v(l,k)
ˆ ˆ 2
λy(l,k)=ηy(k)λy(l−1,k)+(1−ηy(k))Y(l,k) , (6)
ξ(l,k)
v(l,k)= γ(l,k), (4) where Y(l,k) is discrete Fourier transform (DFT) of l-th
1+ξ(l,k)
frame of signal y(t);
where ξ(l,k)=λ (l,k) λ (l,k) is prior signal-to-noise ratio
x n
(7)
(SNR); γ(l,k)=λy(l,k) λn(l,k) - posterior SNR; λn(l,k) - ηdy(k), Y(l,k)2 ≤λˆy(l−1,k);
η (k)=
power spectrum of l-th noisen(t) frame at frequency fk. y ηa(k) otherwise.
y
Fundamentally important and difficult are next two
subtasks when implementing the logMMSE method:
Upper-bound of constant ηd(k) (0≤ηd(k)<1) is
y y
• noise spectrum λ (l,k) estimation;
n
978-1-4799-4580-1/14/$31.00 ©2014 IEEE
2014 IEEE XXXIV International Scientific Conference Electronics and Nanotechnology (ELNANO)
1 N−D−S−I
ηd(k)= , (8) Acc%= ×100%,
y 1+2δ(k)R F N
s
where N is the total number of labels in the reference
and the constant ηa(k) is selected from the conditions
y transcriptions; D is the number of deletion errors; S is the
0≤ηa(k)<ηd(k). number of substitution errors; I is the number of insertion
y y
errors. Indicator PESQ had been used for speech quality
assessment [8]. It is interesting that sometimes this indicator is
IV. SIMULATION EXAMPLES
used for speech intelligibility assessment [10].
VoiceBox [7] routine “ssubmmse.m” designed to reduce
the noise was modified in accordance with propositions of
previous section. Reverberation time was estimated by
applying Schroeder’s method [2] to a bandpass filtered
versions of the AIRs. Moreover, it was taken
a d
η (k)=0,5⋅η (k).
y y
A. Qualitative evaluation of dereverberation performance
Real speech signal was recorded in room with volume 80
m3 and time reverberation 1.1 s (the AIR is shown in Fig. 3).
Parameters of digitized sounds are: sampling frequency 22050
Hz, linear quantization 16 bit. Distance between speaker and
microphone was near 2 m. It is much more of critical distance Fig. 3. Reverberant (а) and enhanced (b) speech signals
D ≈0,5m (D value is calculated by (3.1) from [2]).
c c
Fig. 4. Spectrograms of reverberant (a) and enhanced (b) signals
Fig. 2. Room AIR, T20=1.1 s Toolkit HTK [9] had been used for ASR system
simulation. Training of ASR system had been made with
Waveforms of reverberant and enhanced signals are shown usage of 269 samples of 27 words saved for two speakers-
in Fig. 4, and proper spectrograms are shown in Fig. 5. On women. Sound file of discrete speech (with 0.2…0.5 s pauses)
hearing distorted signal is resound, whereas reconstructed was used as test signal, there were used all 27 words in
signal is much less resound, i.e. positive effect of reverberation training. There were 27 phonemes of Ukrainian language in
suppression is evident. However, there is noticeable by ear phoneme vocabulary and there had been used 39
slight distortion introduced by the dereverberation procedure (it MFCC_0_D_A coefficients when ASR simulating.
was taken T =48ms upon the procedure). Increasing T to
l l Table I contains results of Acc% and PESQ assessment for
100 ms led to some improvement in sound quality of enhanced
clear and reverberant signals. Signals distorted by reverberation
signal. It demonstrates real problem of precise determination of
were simulated as convolution of clean speech signal and room
parameter T value.
l AIRs. There were three rooms with reverberation times 0.74,
0.89 and 1.10 s. Sounds of bursting rubber ball were used as
B. Qualitative evaluation of dereverberation performance AIRs for these rooms. T is the reverberation time, in
20
Quantitative evaluation of dereverberation performance had seconds, based on a 20 dB evaluation range [11].
been made by means of objective measures, such as ASR
As it can be seen from Table I, reverberation significantly
accuracy and speech quality. ASR accuracy assessed using the
affects both the Acc% (reduced from 93% to 22 ... 30%) and
indicator:
the PESQ (reduced from 4.5 to 2.03 ... 2.28).
978-1-4799-4580-1/14/$31.00 ©2014 IEEE
2014 IEEE XXXIV International Scientific Conference Electronics and Nanotechnology (ELNANO)
TABLE I. Acc% AND PESQ FOR CLEAR AND REVERBERANT SIGNALS
Signal kind T20, s Acc% PESQ
Clear 0 92.59 4.5
0.74 22.22 2.281
Reverberant
0.89 22.22 2.073
1.10 29.63 2.030
Results of Acc% and PESQ estimation for enhanced
speech signals are shown in Table II and Fig. 6-7.
As it can be seen, enhancement by method 1 (usage of
Fig. 6. Speech quality in the absence and presence of enhancement
“classic” logMMSE method intended for noise suppression)
did not lead to positive results.
Meanwhile, enhancement by method 2 (usage of proposed
method) had made it possible to significantly increase the
Acc% value (raised from 22 ... 30% to 56…74%). It is
interesting that PESQ value did not raised so much (increased
from 2.281 to 2.33 for T =0.74 c, and only from 2.073 to
20
2.08 for T =0.89 c).
20
Results of experimental studies of dependencies Acc%(Tl)
and PESQ(Tl) are shown in Table III and Fig. 8-9. Fig. 7. Acc%(Tl) dependency
It follows from these results that optimal, in sense of
Acc% maximum, T value lies in the interval 100…200 ms.
l
More uncertain is situation with PESQ(T ) dependency. In
l
two of three cases the speech quality decreases with increasing
T values, and only one case was observed with weakly
l
pronounced maximum at T ≈200...240 ms.
l
TABLE II. Acc% AND PESQ FOR ENHANCED SIGNALS
Acc% PESQ
T20 Enhanced Enhanced Enhanced Enhanced Fig. 8. PESQ(Tl) dependency
(s) by by by by
method 1 method 2 method 1 method 2
TABLE III. Acc% AND PESQ FOR DIFFERENT Tl
0,74 18.52 74.1 2.252 2.33
0,89 14.81 55.6 2.059 2.08 T20, s Tl, ms Acc% PESQ
48 66.7 2.33
1,1 29.63 62.3 2.037 2.23 96 74.1 2.30
144 70.4 2.27
0.74
192 59.3 2.26
240 48.2 2.27
288 44.4 2.26
48 51.9 2.00
96 51.9 2.06
144 51.9 2.05
0.89
192 55.6 2.08
240 44.4 2.08
288 33.3 2.07
48 62.3 2.23
96 62.3 2.19
144 55.6 2.16
1.10
192 51.9 2.07
240 48.2 2.03
288 44.4 2.01
Fig. 5. Recognition accuracy with and without speech enhancement
978-1-4799-4580-1/14/$31.00 ©2014 IEEE
2014 IEEE XXXIV International Scientific Conference Electronics and Nanotechnology (ELNANO)
V. DISCUSSION Performance evaluation of speech recognition and
communication systems when using a relatively simple pre-
As it can be seen from experimental results (for a range of
processor of speech dereverberation, proposed by authors, had
reverberation times of 0.7 ... 1.1, which are typical for
been realized. Proposed method consists in modifying the
laboratories, classrooms and lecture halls), reverberation can
existing logMMSE method, where late reverberation spectrum
significantly reduce ASR accuracy and speech quality. In
estimator is used instead of noise spectrum estimator. Fidelity
particular, Acc% value decreased from 93% to 22 ... 30% and
of the proposal was verified experimentally: Acc% value
PESQ value decreased from 4.5 to 2.0 ... 2.3.
raised to 64% from 25%, and PESQ value also increased,
Direct application of the logMMSE method to reverberant though much less.
speech signals does not allow increasing the accuracy and
quality of speech even a small degree. Proposed modification
REFERENCES
of logMMSE method has improved Acc% significantly, from
22 ... 30% up to 56 ... 75%, and PESQ values also increased,
[1] Israel Cohen, Jacob Benesty, and Sharon Gannot (Eds.), Speech
from 2.13 up to 2.21.
Processing in Modern Communication: Challenges and Perspetives.
The obtained results are preliminary in nature, because of Jan. 2010, 342 p.
training and test samples volumes were small, and the only [2] P. Naylor and N. Gaubitch, Speech Dereverberation. Springer, 2010, 399
p.
logMMSE method was used from a wide set of speech
enhancement methods. It is natural to expect that similar [3] Habets E.A.P. Single- and Multi-Microphone Speech Dereverberation
using Spectral Enhancement, PhD dissertation, Eindhoven, 2007, 257 p.
conclusions will be valid for other methods, such, for example,
[4] T. Yoshioka et al., “Making Mashine Understand Us in Reverberant
as spectral subtraction and MMSE [1].
Rooms,” IEEE Signal Processing Magazine, pp.114-126, Nov. 2012.
Reverberation time was estimated from available AIRs in [5] J.S. Bradley, “The Evolution of Newer Auditorium Acoustics
the paper. In many cases, it is necessary to perform blind Measures,” Canadian Acoustics, 18(4), pp. 13-23, 1990.
reverberation time estimation. Naturally predict that a blind [6] Y. Ephraim and D. Malah, “Speech Enhancement Using a Minimum
Mean-Square Error Log-Spectral Amplitude Estimator,” IEEE
estimation of reverberation time leads to deterioration of signal
Transactions on Acoustic, Speech, and Signal Processing, vol. ASSP-33,
quality and ASR accuracy. Assessment of the extent of this
No. 2, pp. 443-445, Apr. 1985.
deterioration can be an object of the next work.
[7] VOICEBOX: Speech Processing Toolbox for MATLAB [Online].
Available: http://www.ee.ic.ac.uk/hp/staff/dmb/
VI. CONCLUSIONS [8] P. Loizou, Speech enhancement: Theory and Practice. Boca Raton: CRC
Press, 2007, 632 p.
Experimental studies of dependencies Acc%(T ) and
l [9] S. Young et al., The HTK Book. Cambridge University Engineering
PESQ(T )were conducted. It was shown that optimal, in sense Department, 2005, 354 p. [Online]. Available:
l http://htk.eng.cam.ac.uk/download.shtml
of Acc% maximum, T value lies in the interval 100…200
l [10] J. Beerends, E. Larsen, N. Lyer, and J. van Vugt, “Measurement of
ms. More uncertain is situation with PESQ(T ) dependency, speech intelligibility based on the PESQ approach,” in Proc. Int. Conf.
l Meas. Speech Audio Quality Netw., 2004, 4 p.
where, in two of three cases, the speech quality decreased with
[11] ISO 3382-1:2009. Acoustics. Measurement of room acoustic parameters.
increasing Tl values, and only one case was observed with Part 1. Performance spaces. ISO, 2009, 26 p.
weakly pronounced maximum at T ≈200...240 ms.
l
978-1-4799-4580-1/14/$31.00 ©2014 IEEE