Spanish Speech Emotion Recognition (SER)

Meet the lastest SER results in Spanish.

By: Alex Mares

▶ Try the HCI Demo on Gradio ← Back to Main

Part 1. PTMs as feature extractors for Spanish SER

Summary of best results and SOTA comparison — We have the highest results in the state-of-the-art, evaluated in 6 different databases.

This study presents the first comparative evaluation of PTMs for Spanish SER across six models—Whisper, Wav2Vec 2.0, WavLM, HuBERT, TRILLsson, and CLAP—and six emotional speech datasets. Using a layer-wise feature extraction framework with Leave-One-Speaker-Out validation, our method outperforms prior benchmarks, reaching F1-scores of 88.32% (EmoMatchSpanishDB), 99.83% (INTER1SP), and 92.53% (MEACorpus).

Read the full paper on MDPI

F1 per layer — We perform a layer-wise evaluation to obtain the best PTM along with its optimal layer.

F1 LOSO — Our LOSO validation confirms the results: The W2V2-L-XLSR53-ES is the best model for extracting features.

Part 2. Multitask Learning SER System (Thesis)

This work introduces the first MTL approach for Spanish SER, trained on six diverse corpora. Using a frozen Wav2Vec2 XLSR encoder and an MLP classifier, the proposed system surpasses the single-task baseline by 2.37 WF1 points, reaching 90.56% in emotion classification, while also achieving near-perfect scores in speaker profiling (99.39%) and regional accent detection (99.91%).

Summary of results in Thesis — The Multitask model without any data augmentation/balancing is the best approach.

▶ Try the HCI Demo on Gradio View the Code on GitHub

← Back to Main

Spanish Speech Emotion Recognition (SER)

Part 1. PTMs as feature extractors for Spanish SER

Part 2. Multitask Learning SER System (Thesis)

Dataset Overview Dashboard