Multi-Condition Training for Unknown Environment Adaptation in Robust ASR Under Real Conditions

Automatic speech recognition (ASR) systems frequently work in a noisy environment. As they are often trained on clean speech data, noise reduction or adaptation techniques are applied to decrease the influence of background disturbance even in the case of unknown conditions. Speech data mixed with noise recordings from particular environment are often used for the purposes of model adaptation. This paper analyses the improvement of recognition performance within such adaptation when multi-condition training data from a real environment is used for training initial models. Although the quality of such models can decrease with the presence of noise in the training material, they are assumed to include initial information about noise and consequently support the adaptation procedure. Experimental results show significant improvement of the proposed training method in a robust ASR task under unknown noisy conditions. The decrease by 29 % and 14 % in word error rate in comparison with clean speech training data was achieved for the non-adapted and adapted system, respectively.


Introduction
Automatic Speech Recognition (ASR) in a noisy environment has been a challenging issue in recent decades for many research centers, as the presence of noise significantly decreases the accuracy of ASR systems.There are several approaches to compensate the effect of unclean conditions, which can be combined together with more or less advantageous results.
The first class of these methods is applied before acoustic modelling in front-end signal preprocessing.The signal is standardly represented by auditory-based features PLPs [1] or MFCCs to minimize the effect of speaker variability.Then noise suppression methods, such as most widely used Spectral Subtraction (SS) [2], Wiener filtering, and Minimum Mean Square Error (MMSE) estimation [3], are applied within front-end signal processing to minimize the noise level background in the analyzed speech.
The second class involves approaches, that take effect in the modelling phase.The models of speech and pause are typically trained on clean speech data to ensure high quality of the final models of speech.Model adaptation transforms clean speech models to perform well in a noisy environment.Several adaptation techniques use background noise, which is combined with the speech signal e.g. in multi-environment models [4], or with acoustic models in parallel model combination (PMC) [5].Other techniques use noisy speech data to adapt acoustic models for particular background conditions by simply retraining the clean speech models or by some transformation using maximum likelihood linear regression (MLLR) [6] or Maximum A Posteriori (MAP) adaptation [7].The latter two schemes are also used also for speaker adaptation with only a small proportion of adaptation material.
Due to varying or unknown target background conditions, and due to the high costs of collecting speech data in a real environment, not enough data that matches the recognition conditions for the adaptation procedure is typically available.Therefore a set of data for "almost matched" conditions is often used for training or model adaptation [4,8].
In [9], clean speech was mixed additively with real noise from a car to get adaptation data.The final models were then adapted on these recordings by MLLR and MAP, with a resulting improvement from 14.38 % to 5.73 %.The authors show the advantage of using noisy data for training speech models in a car environment using additive mixing of clean speech and noise.Similarly, an additive noise approach outperformed the recognition results of a baseline system in different environmental conditions trained and tested on the Aurora 2 database in [10].
Unlike using only additive noise, data recorded in real conditions is used in this paper.The aim of our work is to analyse the influence of using multi-environment training material for robust speech recognition in an unknown environment.The recordings from the real world are important from the point of view of the real influence of noisy conditions.Not only additive distortion but also convolutional distortion are taken into account.
As shown e.g. in [11], joint usage of spectral subtraction and MLLR adaptation seems to be a good framework for a recognition task under conditions with a high level of background noise.These techniques can be used for blind adaptation, and they are therefore also useful for unknown noise reduction.This paper describes the effect of multi--condition training in several phases of the noise reduction algorithm shown in Fig. 1.The spectral subtraction technique is standardly used in front-end processing as a blind noise suppression method (see Fig. 1).Model parameters can be subsequently changed using single-pass retraining as a simple approach for offline adaptation or MLLR as a standard method which can be used for both offline and online adaptation.

Spectral subtraction
Spectral Subtraction (SS) is a technique frequently used for the suppressing the additive background noise component in the spectral domain to eliminate stationary or non--stationary noise with rather slow changes in characteristics.The characteristics of noise are estimated from speech pauses found e.g. by the Voice Activity Detector (VAD), which can often the limiting point of the algorithm.In our work, extended Spectral Subtraction (ESS) [12] uses modified adaptive Wiener filtering working without VAD.

Single-pass retraining
Single-pass retraining of the models is often used when a large amount of data with matching environmental conditions is available and offline retraining of acoustic models can be performed.The parameters of clean speech models are changed within one pass of the retraining procedure.This retraining is performed on the set of recordings with a matching environmental background.Such data will be called matching data in the following text.The disadvantage of this approach lies in the need for a sufficient amount of matching data for each model.This can be very difficult, mainly in the case of a specific environment.In addition a large number of speakers are needed for speaker independent recognition tasks.For these reasons, single-pass retraining was used as a low-bound result for unsupervised speaker adaptation experiments with an increasing amount of adaptation data in [6].

MLLR
As noted above, there is often not enough available data for the single-pass retraining procedure.MLLR uses a small amount of adaptation material to estimate an affine linear transform A, b of the model parameters, which is found in terms of minimizing the likelihood of adaptation data.Based on our preliminary tests, we use only MLLR of mean vectors for our experiments, and other model parameters are unchanged.The new mean vector is then given by The same transform A, b can be applied to the mean vector of all models (global adaptation) or the models can be clustered on the basis of acoustic similarity into several classes.Separate transforms are then applied to particular classes.This clustering can represent the different effect of background distortion on particular speech phones.The regression class approach also enables us to cluster the models according to the amount of adaptation data to ensure sufficient quality of the transform.Binary regression class tree clustering [13] is used in this work.

Experiments
The experiments were performed on a small vocabulary speaker independent (SI) speech recognition task.The Czech digit sequence recogniser based on HMMs of monophones was used for this purpose.

Front-end setup
Front-end signal processing was carried out using the CtuCopy parametrization tool [14].This enables similar functionality to the HTK HCopy tool [13] and provides additional noise reduction algorithms, e.g.VAD detection, spectral subtraction and LDA RASTA-like filtration.
Table 1 summarizes the overall setting of the recognition front-end.

Databases
The Czech Speecon database [15] was used for training and testing, i.e. 16 kHz data recorded in different environments using several types of microphones.The database involves utterances from almost 600 speakers with different content, e.g.phonetically rich sentences, digits, commands, etc.
Table 2 shows the division of the database in accordance with various environmental conditions.The whole database (ALL) was divided on the basis of type of recording environment (CLEAN and NOISY) or estimated SNR level (HISNR and LOSNR).Subsets with specific environment (OFFICE and CAR) were also created.
Each Data from two different channels using a head-set microphone (CS0) and hands-free set (CS1) was used for testing.The CS1 channel is assumed to capture a higher level of background noise, which is illustrated in the estimated SNR values in Table 2.Each testing subset was divided for retraining or adaptation purposes according to the content, into a testing subset, which involves digits only, and a subset with the rest of the testing set, called the matched set.
As noted in sec.2.3, the MLLR adaptation technique can work with a low amount of adaptation data.Subsets containing 20, 50, 100, 200, 500 and 1000 utterances were therefore created from each matched subset for comparison purposes.For the speaker-independent recognition task, each such subset involved as many speakers as possible.Not fewer than 18 speakers were present in the final subsets.This number can be considered as sufficient with regard to the number of speakers used in [9] (10-80) to get improvement in a speaker-independent task.Table 3 shows the average amount of adaptation data for different limits of utterances.

Spectral subtraction in different conditions
Training the models on clean data, the presence of environmental distortion significantly decreases the recognition accuracy.As Table 4 shows, using ESS helps to suppress the influence of unclean conditions.Although the results are worse for matching conditions (Clean, CS0), the overall results give more than 8 % of WER enhancement.
A similar improvement was achieved for multi-condition training (Table 5).Although the unclean environment in the training phase decreases the quality of the resulting models, the overall contribution of using the multi-condition training database with ESS against the case of clean training data (Table 6) is almost 30 % of WER.

Single-pass retraining
All matching data for particular testing subsets was used for single-pass retraining in the case of clean or multi-condition training data.With regard to the results in section 3.3, ESS was used within front-end signal processing in the following experiments.
Table 7 shows that using multi-condition training data for training the initial models for single-pass retraining brings an improvement of over 22% against the clean speech models.All available matching data were used in this experiment, which led to the final set of 2400 (CAR) -11600 (ALL) utterances for retraining.

MLLR
Single-pass retraining acts as a low-bound value for environment adaptation, as the amount of data for retraining is rather high.Only a limited amount of adaptation material for particular conditions is available in a real system, and decreasing the proportion of data for single-pass retraining procedure can lead to a significant decrease in recognition accuracy.MLLR-based adaptation removes this disadvantage.As shown in Fig. 2, even for a low amount of adaptation data the accuracy of a MLLR-adapted system outperforms the baseline and single-pass results.Section 3.2 describes the adaptation subsets with a limited number of utterances, which reduces the computational load of the adaptation procedure.The recognition tests were performed on each subset and the results presented here show the value averaged over all these limited adaptation subsets.Various settings of model clustering for regression tree--based adaptation according to section 2.3 were used within the experiments.Global transformation and the division into 2, 4, 8, 16 and 32 regression classes were used, and the case with minimum achieved WER is reported in the following table.
The recognition results in Table 8 again show the improvement for using multi-condition training material for initial models.Only the case for very clean conditions (Clean, CS0) brings a slight decrease in WER.The contribution is evidentavg mainly for channel mismatch (CS1).

Overall improvement
Fig. 3 summarizes the contribution of using multi-condition training data for initial training in particular phases of the noise reduction procedure.The use of multi-condition training data leads to a significant improvement in all phases of the system.
The proposed noise reduction method led to the enhancement of WER by 48 %.

Conclusion
The paper shows the advantages of using a multi-condition training data for robust ASR in unknown background conditions.The main contribution of the work is in using recordings from a real environment, which reflects the real influence of noise in a robust recognition task.
The results can be summarized in the following points.• Multi-condition (M-C) training brings significant improvement to recognition accuracy, even in the case without any other noise reduction method.In the results presented here, multi-condition training outperforms the system that uses spectral subtraction and clean training data by 22 %.
A combination of M-C training and spectral subtraction algorithm resulted in more than 29 % enhancement of WER against the baseline system.An increase in recognition accuracy by more than 70 % can be observed for data recorded in a car.• Single-pass retraining gives a robust offline procedure for correcting acoustic models when enough matching data is available for a high variety of speakers and a rich phonetic content.The main contribution was observed for channel mismatch, and the use of M-C trained initial models brought an additional improvement to these results.• Advantageous clustering of models based on available adaptation data within MLLR adaptation is shown to bring an improvement over single-pass retraining.The final improvement using M-C trained models only slightly outperforms the single-pass results.
Generally, multi-condition training material for initial training of speech models seems to bring an improvement to the recognition task in unknown environmental conditions.As the training and testing data in our experiments come from the same source, future work will be oriented to higher mismatches in adaptation and recognition conditions.

Fig. 1 :
Fig. 1: Block scheme of the noise reduction algorithm

Fig. 2 :
Fig. 2: WER for different amount of training data for single-pass retraining and MLLR adaptation

Fig. 3 :
Fig. 3: Average WER in particular phases of noise reduction for clean and multi-condition training

Table 1 :
subset was divided into a training part and a testing part, taking into account a sufficient number of speakers for Front-end setup

Table 2 :
Description of SPEECON subsets and average estimated SNR SI recognition task.Training was performed on head-set microphone (CS0) data.Only the subsets ALL and OFFICE were used for training to simulate multi-condition training or clean data training, respectively.

Table 3 :
Average amount of speech data for limited adaptation subsets

Table 4 :
WER for different environmental conditions w/o and with ESS in front-end processing.The models are trained on clean data.

Table 5 :
WER for different environmental conditions w/o and with ESS in front-end processing.The models are trained on multi-condition data

Table 6 :
WER for clean (Clean) and multi-condition (M-C) training data and relative improvement for multi-condition training against clean training -no retraining/adaptation

Table 8 :
WER for clean (Clean