Publications

<임성균>

Separation of Emotional and Reconstruction Embeddings on Ladder Network to Improve Speech Emotion Recognition Robustness in Noisy Conditions

저자: Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso

저널(학회)명: Proceedings of Interspeech 2021

출판연도: 2021

<Abstract>

When speech emotion recognition (SER) is applied in an actual application, the system should be able to cope with audio acquired in a noisy, unconstrained environment. Most studies on noise-robust SER require a parallel dataset with emotion labels, which is impractical to collect, or use speech with artificially added noise, which does not resemble practical conditions. This study builds upon the ladder network formulation, which can effectively compensate the environmental differences between a clean speech corpus and real-life recordings. This study proposes a decoupled ladder network, which increases the robustness of the SER system against the influences of non-stationary background noise by decoupling the last hidden layer embedding into emotion and reconstruction embeddings. This novel implementation allows the emotion embedding to focus exclusively on building a discriminative representation, without worrying about the reconstruction task. We introduce a noisy version of the MSP-Podcast database, which contains audio segments collected with a smartphone that simultaneously records sentences from the corpus and non-stationary noise at different signal-to-noise ratios (SNRs). We test the effectiveness of our proposed model with this corpus, showing that the decoupled ladder network can increase the performance of the regular ladder network when dealing with noisy recordings.

심층 신경망(Deep neural network)이 발전함에 따라 음성 기반 감정 인식 시스템의 성능 또한 크게 올라갔지만 잡음이 많은 실제 환경에서 시스템을 사용했을 때는 아직도 성능이 크게 떨어집니다. 이 논문에서는 semi-supervised learning을 기반으로 한 Decoupled ladder network 모델을 제안함으로써 다량의 깨끗한 음성과 실제 환경에서 수집한 잡음 음성을 동시에 모델 학습에 반영하여 잡음 환경에서의 감정 인식 성능을 높이는 연구를 진행했습니다. 또한 팟캐스트로 부터 수집한 자연스러운 emotional speech와 다양한 소리가 섞여있는 라디오 잡음을 직접 녹음하여 실제 환경에서 녹음되는 noisy emotional speech dataset을 수집하였습니다.