Speech Recognition (ASR) Transformers are neural architectures that convert spoken audio into text by processing mel-spectrogram features through encoder-decoder or encoder-only Transformer networks — achieving human-level transcription accuracy across multiple languages through self-supervised pre-training on hundreds of thousands of hours of unlabeled audio.
Architecture Evolution:
- CTC-Based (Connectionist Temporal Classification): encoder-only model outputs character or subword probabilities for each audio frame; CTC loss aligns variable-length audio with variable-length text without explicit alignment; simple but lacks language model context between output tokens
- Attention-Based Encoder-Decoder: audio encoder produces acoustic representations; text decoder attends to encoder outputs and generates tokens autoregressively; captures language model context but attention can lose monotonic alignment for long utterances
- CTC+Attention Hybrid: combine CTC and attention objectives during training; use CTC for alignment regularization and attention for flexible generation; ESPnet and Whisper architectures demonstrate hybrid benefits
- Conformer: replaces standard Transformer encoder with Conformer blocks combining convolution (local audio patterns) and self-attention (global context); convolution captures local spectral features that pure attention may miss; dominant architecture in production ASR systems
Whisper (OpenAI):
- Architecture: encoder-decoder Transformer; encoder processes 30-second mel spectrogram segments (80 mel bins × 3000 frames); decoder generates text tokens autoregressively with special tokens for language detection, timestamps, and task specification
- Training Data: 680,000 hours of labeled audio from the internet (web-sourced with weak supervision); multilingual training covers 99 languages; no manual data curation — quality filtering through heuristic cross-referencing
- Multitask Training: single model handles transcription, translation, language identification, and voice activity detection through task-specifying tokens in the decoder prompt
- Robustness: trained on diverse acoustic conditions (background noise, accents, recording quality); generalizes to unseen domains without fine-tuning; competitive with domain-specific systems across benchmarks
Self-Supervised Pre-training:
- wav2vec 2.0 / HuBERT: pre-train encoder on unlabeled audio using contrastive or masked prediction objectives; learn speech representations from raw waveforms; fine-tune with CTC on small labeled datasets (10-100 hours) achieving results comparable to supervised models trained on 10,000 hours
- Representation Learning: encoder learns hierarchical speech features — lower layers capture acoustic/phonetic features, upper layers capture linguistic structure; pre-trained representations transfer across languages, accents, and recording conditions
- Low-Resource Languages: self-supervised pre-training enables ASR for languages with minimal labeled data; MMS (Meta) covers 1,100+ languages by pre-training on 500K hours of unlabeled audio and fine-tuning with as few as 1 hour of transcribed speech per language
- Data Efficiency: reduces labeled data requirements by 10-100×; pre-training on unlabeled audio (cheap and abundant) plus fine-tuning on labeled audio (expensive and scarce) is the standard paradigm
Production Deployment:
- Streaming vs Offline: offline models process complete utterances (higher accuracy); streaming models process audio in real-time chunks (lower latency, needed for voice assistants and live captioning); chunked attention and causal convolutions enable streaming Conformer architectures
- Inference Optimization: INT8 quantization reduces model size and speeds inference 2-3× with <0.5% WER degradation; beam search width 5-10 for quality vs greedy decoding for speed; speculative decoding transfers to ASR for faster generation
- Word Error Rate (WER): standard metric is edit distance between predicted and reference transcriptions normalized by reference word count; human WER on conversational speech is ~5%; best models achieve 2-4% WER on clean read speech (LibriSpeech)
Speech recognition transformers have achieved the long-standing goal of human-parity transcription accuracy for major languages — Whisper's multilingual capability and wav2vec 2.0's data efficiency represent breakthroughs that make accurate speech recognition accessible for virtually every language and acoustic condition.