Core idea

Wav2vec learns powerful speech representations through self-supervised pre-training on unlabeled audio. Core idea: Like BERT for audio - learn general representations from massive unlabeled audio, fine-tune for downstream tasks with small labeled data. Wav2vec 2.0 architecture: CNN feature encoder leads to transformer context network leads to contrastive loss. Learns to identify correct latent for masked positions from distractors. Training: Mask portions of audio, predict masked latent representations from negatives via contrastive learning. Pre-training data: Thousands of hours of unlabeled speech. Downstream tasks: ASR (speech recognition), speaker ID, emotion recognition, language identification. Results: Approaches supervised performance with 1% of labeled data. Enables ASR for low-resource languages. XLS-R / XLSR: Multilingual wav2vec trained on 128 languages. HuBERT: Alternative self-supervised approach using clustering-based targets. Fine-tuning: Add linear layer or CTC head, fine-tune on labeled task. Impact: Democratized speech AI - high-quality ASR possible without massive labeled corpora. Foundation for many speech models and APIs.

Want to learn more?