Automatic Speech Recognition (ASR) is the task of converting speech audio to text — employing neural networks with CTC loss, encoder-decoder architectures, and self-supervised pretraining to achieve high accuracy competitive with human performance on various domains.
CTC Loss (Connectionist Temporal Classification):
- Alignment problem: speech frames ~30-100ms; target tokens variable duration; CTC solves alignment automatically
- Blank token: CTC introduces special blank token for non-speech frames; enables flexible alignments
- Forward-backward algorithm: efficiently computes probability of output sequence over all alignments
- Training: minimize CTC loss (summed over all valid alignments); no manual frame-level alignment needed
- Decoding: greedy selection or beam search; CTC removes consecutive duplicates and blanks
- Advantages: enables end-to-end training; reduces pipeline complexity vs. traditional HMM-GMM systems
Encoder-Decoder Architecture (RNN-T/Transformer-Transducer):
- Encoder: BiLSTM or Transformer processes entire audio input; outputs context vector
- Decoder: RNN predicts output tokens autoregressively; attends to encoder for context
- Attention mechanism: soft attention over encoder outputs; learns to focus on relevant audio frames
- Joint modeling: combines attention + autoregressive decoding; flexible architectures
- Streaming capability: can process streaming audio (chunk-based processing) with appropriate modifications
Wav2Vec 2.0 Self-Supervised Pretraining:
- Masked prediction: mask input audio frames; predict masked frames from surrounding context
- Contrastive learning: distinguish true target from negatives sampled from codebook
- Learned quantization: continuous features quantized to discrete codebook; enables contrastive setup
- Foundation model: pretrain on unlabeled audio (100x more than labeled); transfer to downstream ASR
- Dramatic improvement: wav2vec 2.0 pretraining enables strong ASR with limited labeled data
- Multilingual wav2vec: XLSR pretrains on 128 languages; enables zero-shot cross-lingual transfer
Conformer Architecture:
- Hybrid design: interleaves convolutional blocks (local feature extraction) with transformer blocks (long-range context)
- Convolutional blocks: depthwise separable convolutions capture local patterns; positional information
- Transformer blocks: multi-head self-attention captures long-range dependencies; parallel processing
- Macaron-style FFN: position-wise feed-forward networks; improves gradient flow
- Performance: Conformer achieves state-of-the-art on LibriSpeech, CommonVoice; outperforms pure CNN/RNN/Transformer
Language Model Integration:
- Shallow fusion: add language model logits to acoustic model logits during decoding; simple post-hoc method
- Deep fusion: incorporate language model predictions into intermediate decoder layers; better integration
- Shallow+deep fusion: combine both shallow and deep fusion; further improvements
- External ARPA n-gram LMs: traditional language models integrated with neural acoustic models
- Neural language models: LSTM or transformer LMs trained on text corpus; capture language structure
Beam Search Decoding:
- Heuristic search: maintain K best hypotheses (beam width); expand beam by predicting next token
- Pruning: remove low-probability hypotheses; maintain tractable beam width (typically 8-128)
- Language model rescoring: rerank beam hypotheses using language model probabilities
- Length normalization: penalize overly long/short hypotheses; encourage appropriate sequence lengths
- Inference speed: larger beam width improves accuracy but increases latency; accuracy-latency tradeoff
Word Error Rate (WER) Evaluation:
- WER metric: 100 * (S + D + I) / N; S = substitutions, D = deletions, I = insertions, N = reference words
- Benchmark datasets: LibriSpeech (1000 hours clean/noisy English), CommonVoice (multilingual), VoxPopuli (European Parliament)
- State-of-the-art: Conformer + wav2vec 2.0 + LM achieves ~2-3% WER on LibriSpeech test-clean
- Robustness: test-other subset; noisy conditions with background noise, speakers, reverberation
Real-World ASR Challenges:
- Acoustic variation: speaker differences, background noise, reverberation, accents; robust acoustic modeling
- Domain mismatch: training data distribution different from deployment; domain adaptation techniques
- Streaming constraints: online streaming ASR requires low latency; incompatible with full lookahead
- Computational constraints: edge deployment requires model compression; quantization, pruning, distillation
- Multilingual/code-switching: handling multiple languages within single utterance; shared representations
ASR System Components:
- Feature extraction: Mel-frequency cepstral coefficients (MFCC) or log-Mel spectrogram; acoustic features
- Normalization: mean-variance normalization per utterance; stabilizes training
- Augmentation: SpecAugment (mask frequency/time bands); improves robustness without additional data
- Contextualization: biased language models for domain-specific terms; personalization and named entities
Automatic speech recognition converts audio to text using neural networks with CTC alignment or encoder-decoder architectures — leveraging self-supervised pretraining (wav2vec 2.0) and language models to achieve near-human performance.