Spanish Speech Emotion Recognition (SER)

Meet the lastest SER results in Spanish.


By: Alex Mares

▶ Try the HCI Demo on Gradio ← Back to Main

Part 1. PTMs as feature extractors for Spanish SER


Summary of best results and SOTA comparison
We have the highest results in the state-of-the-art, evaluated in 6 different databases.

This study presents the first comparative evaluation of PTMs for Spanish SER across six models—Whisper, Wav2Vec 2.0, WavLM, HuBERT, TRILLsson, and CLAP—and six emotional speech datasets. Using a layer-wise feature extraction framework with Leave-One-Speaker-Out validation, our method outperforms prior benchmarks, reaching F1-scores of 88.32% (EmoMatchSpanishDB), 99.83% (INTER1SP), and 92.53% (MEACorpus).

Read the full paper on MDPI

Part 2. Multitask Learning SER System (Thesis)


This work introduces the first MTL approach for Spanish SER, trained on six diverse corpora. Using a frozen Wav2Vec2 XLSR encoder and an MLP classifier, the proposed system surpasses the single-task baseline by 2.37 WF1 points, reaching 90.56% in emotion classification, while also achieving near-perfect scores in speaker profiling (99.39%) and regional accent detection (99.91%).

Summary of results in Thesis
The Multitask model without any data augmentation/balancing is the best approach.
▶ Try the HCI Demo on Gradio View the Code on GitHub
← Back to Main

Dataset Overview Dashboard