Spanish Emotional Text-To-Speech Synthesis

By: Alex Mares

Summary of best results and SOTA comparison — Our emotional TTS enables the HCI system to respond empathetically.

The model xtts-finetune-webui was used to fine-tune XTTS using the female Spanish speaker subset from the INTERFACE dataset (INTER1SP corpus). Fine-tuning targeted the acoustic decoder while keeping the multilingual backbone frozen, enabling efficient speaker adaptation with limited data. Read the full paper on MDPI