Recent progress in speech processing has been largely driven by deep learning and, more recently, by self-supervised and pre-trained models that learn rich representations from large amounts of audio samples. These advances have significantly improved the performance and robustness of systems across a wide range of tasks, from automatic speech recognition (or speech-to-text) to speaker recognition and diarization (multi-talker scenarios). Modern approaches rely on neural architectures and transferable representations (embeddings) that can be adapted to different scenarios, languages, and acoustic conditions.

This course is designed to provide a structured and practical introduction to these developments, guiding participants from the fundamentals of speech representations to the design of complete speech processing systems. It will progressively cover core tasks such as speech-to-text, speaker recognition, and speaker diarization, highlighting the role of embedding-based methods, end-to-end modeling, and evaluation methodologies. Emphasis will be placed on understanding how different components interact within real-world pipelines, as well as on the use of modern tools and pre-trained models to build specific systems. Throughout the course, participants will gain experience implementing and analyzing speech processing solutions, enabling them to apply these techniques in research or industrial settings.

The course is part of the NLP master hosted by the Ixa NLP research group at the HiTZ research center of the University of the Basque Country (UPV/EHU).

Student profile

Addressed to professionals, researchers and students with programming and Python experience. Math and signal processing knowledge (at the level of a BSc in Sciences or Engineering) is also recommended. Although not strictly necessary, subscribing to Colab Pro is recommended for more GPU availability.

Contents

Foundations of Speech Processing and Deep Learning

Introduction to Speech Signals and Representations
Neural Networks for Speech Processing
Deep Learning Training Paradigms
LABORATORY: Speech Representations

Deep Learning for Speech Recognition: Speech-to-Text

Introduction to Automatic Speech Recognition
Modelling Approaches (CTC, Transformers)
Modern ASR Systems (End-to-end, pre-trained models)
Evaluation, Common Benchmarks and Tools
LABORATORY: Speech-to-Text

Deep Learning for Speaker Recognition: Embedding-based Representations

Introduction to Speaker Recognition
Speaker Embeddings
Robust Learning and Representations
Evaluation, Common Benchmarks and Tools
LABORATORY: Speaker Embeddings

Deep Learning for Speaker Diarization: Multi-talker Scenarios

Introduction to Speaker Diarization
Clustering-based Approaches
End-to-End Diarization
Evaluation, Common Benchmarks and Tools
Joint ASR and Diarization
LABORATORY: Speaker Diarization

Instructors

Alicia Lozano-Diez

Alicia Lozano-Diez

Associate Professor at UAM
Universidad Autónoma de Madrid

Practical details

General information

Part of the Language Analysis and Processing master program.
  • 4 sessions totalling 10 hours of instruction.
  • Scheduled from July 13th to July 16th 2026, 15:00 - 17:30 CET..
  • Teaching language: English.
  • Cost: 250€ (1 ECTS).
  • Although not strictly necessary, subscribing to Colab Pro is recommended for more GPU availability during the practical labs.
  • Course code: DL4SP.