Speech and Audio Processing Chip: Always-On Keyword Spotting Engine — ultra-low-power neural network for wake-word detection enabling voice assistant activation with <1 mW standby power budget
Always-On Keyword Spotting Architecture
- Ultra-Low Power: <1 mW standby power (AAA battery drain ~1 year runtime), achieved via specialized DSP + NPU for audio processing
- Neural Network Model: DS-CNN (depthwise separable CNN) or LSTM for keyword detection, ~50 kB model size for sub-1 mW
- Trigger Latency: <100 ms detection latency (user-acceptable wake-word response), balanced against false-positive rejection
- False Positive Rate: <10 false positives per 24 hours acceptable (user experience), tuned via model training data
Audio Front-End (AFE)
- Microphone Interface: PDM (pulse-density modulation) or analog microphone input, ~8-16 kHz sampling rate for speech (reduces power vs 48 kHz)
- ADC Converter: PDM-to-PCM converter (CIC filter + decimator), converts 1-bit PDM stream to multibit PCM
- Analog Preprocessing: microphone preamp (adjustable gain), low-pass filter (anti-aliasing), high-pass filter (DC removal)
- Power Efficiency: AFE typically ~50-100 mW (dominant consumer besides DSP)
Keyword Spotting Neural Network
- DS-CNN Model: depthwise separable layers (reduce parameters 8-10×), 1-2 hidden layers, output classification (wake-word + background)
- Quantization: INT8 or INT4 weights (reduces model size 4-8×), maintains accuracy within 1-2%
- Feature Extraction: MFCC (mel-frequency cepstral coefficient) or log-mel spectrogram computed on-chip (batched with NPU)
- Training Data: keyword-specific (e.g., "Alexa", "OK Google"), negative class (silence, noise, other speech)
DSP + NPU Architecture
- ARM Cortex-M4/M55: main processor, audio buffer management, command dispatch
- Ethos-U55/U85: dedicated neural engine (Arm), INT8 MAC arrays, runs CNN inference at <100 mW
- Custom DSP: vendor-specific audio DSP (RISC-like, typically 16-bit ALU), dedicated for audio effects
- Heterogeneous Processing: AFE on analog circuits, feature extraction on DSP, NN inference on NPU (power optimized per stage)
Commercial Always-On Solutions
- Ambiq Apollo: ultra-low-power MCU (M4 + Ethos-U), <0.5 mW standby, Ambiq's proprietary architecture
- Nordic nRF5340: Cortex-M33 + Cortex-M4, integrated 2.4 GHz radio, Zigbee/BLE, ~10 mW active
- Infineon PSoC 6: Cortex-M4 + M0, floating-point unit, MEMS sensor integration
- Smart Speaker SoC (Amazon, Google, Apple): full integration (microphone, AFE, DSP, NPU, RF), sealed ecosystem
Beamforming + Noise Cancellation
- Microphone Array: 2-4 microphones on device, spatial filtering to enhance desired direction
- Delay-and-Sum Beamforming: align signals from multiple mics (phase shift), sum coherently to focus on one direction
- Adaptive Filtering: least-mean-squares (LMS) or similar cancels background noise, improves wake-word detection robustness
- Power Trade-off: beamforming adds DSP complexity (10-20 mW), justified for robust far-field detection (3-5 m range)
Far-Field Wake-Word Detection
- Acoustic Echo Cancellation (AEC): remove loudspeaker echo from microphone signals (enables simultaneous speaker output + listening)
- Noise Suppression: spectral subtraction or NN-based denoising, reduces ambient noise (fan, traffic)
- Voice Activity Detection (VAD): suppress non-speech segments before feature extraction, reduces false positives
- Range: far-field (3-5 m) vs near-field (0.5 m), far-field requires stronger preprocessing
PDM Microphone Interface
- Pulse-Density Modulation: 1-bit output at high frequency (1-4 MHz), represents signal as pulse density
- Advantages: simple microphone circuit, no ADC in microphone, robust to noise
- PDM-to-PCM: CIC decimation filter (cascaded integrator-comb) reduces 1-bit stream to multibit PCM, computationally efficient
Low-Power Optimization Techniques
- Event-Driven Processing: only process when audio detected (VAD-based gating), sleep during silence
- Clock Gating: disable DSP/NPU clocks when not needed (between audio buffers)
- Dynamic Voltage/Frequency: lower frequency during silent periods (~1 MHz), boost to 50+ MHz for active recognition
- Model Compression: pruning, quantization, knowledge distillation reduce model size + inference time
Challenges and Trade-offs
- Privacy: local keyword spotting (no cloud upload) preferred for privacy, requires on-device neural engine
- Accuracy vs Power: more complex models improve accuracy (fewer false positives) but increase power
- Language Diversity: multilingual wake-word requires larger model or multiple models (power penalty)
Future Roadmap: wake-word detection becoming standard in consumer devices (wearables, earbuds, smart home), multimodal (audio+visual) wake-up emerging, on-device privacy assumed standard.