Perceiver IO

Perceiver IO is an extension of Perceiver that adds flexible output decoding through output query arrays — enabling the same architecture to produce structured outputs of arbitrary size and type (class labels, pixel arrays, language tokens, optical flow fields) by using learned output queries that cross-attend to the latent array, making it the first truly general-purpose architecture for any input-to-any output deep learning tasks.

What Is Perceiver IO?

- Definition: A generalized Perceiver architecture (Jaegle et al., 2021, DeepMind) that adds an output decoder based on cross-attention — output query vectors (describing what outputs are needed) attend to the latent array to produce structured outputs of any size and type, completing the vision of a universal input→latent→output architecture.
- What Perceiver Lacked: The original Perceiver could handle arbitrary inputs but had limited output flexibility — typically a single classification token. Perceiver IO solves this by allowing arbitrary output specifications through query arrays.
- The Generalization: Any deep learning task can be framed as: "Given input X, produce output Y" — where X and Y can be images, text, labels, flow fields, or any structured data. Perceiver IO handles all of these with the same architecture.

Architecture

| Stage | Operation | Dimensions | Purpose |
|-------|----------|-----------|---------|
| 1. Encode | Cross-attention: latent queries → input | Input: N_in × d_in → Latent: M × d | Compress input into latent bottleneck |
| 2. Process | Self-attention on latent array (L blocks) | M × d → M × d | Refine latent representations |
| 3. Decode | Cross-attention: output queries → latent | Latent: M × d → Output: N_out × d_out | Produce structured outputs |

Output Query Design

| Task | Output Queries | What They Represent | Output |
|------|---------------|-------------------|--------|
| Classification | 1 learned query vector | "What class is this?" | Class logits |
| Image Segmentation | H×W query vectors (one per pixel) | "What class is each pixel?" | Per-pixel class labels |
| Optical Flow | H×W×2 queries with position encoding | "What is the motion at each pixel?" | Per-pixel flow vectors |
| Language Modeling | Sequence of position-encoded queries | "What is the next token at each position?" | Token logits per position |
| Multimodal | Mixed queries for different output types | "Classify image AND generate caption" | Multiple heterogeneous outputs |

Why Output Queries Are Powerful

| Property | Standard Networks | Perceiver IO |
|----------|------------------|-------------|
| Output structure | Fixed by architecture (e.g., FC layer for classification) | Any size, any structure via queries |
| Multiple outputs | Need separate heads | Single decoder with different queries |
| Output resolution | Determined by network design | Determined by number of output queries |
| Cross-task architecture | Different models per task | Same model, different output queries |

Tasks Demonstrated with Single Architecture

| Task | Input | Output | Perceiver IO Performance |
|------|-------|--------|------------------------|
| ImageNet Classification | 224×224 image | 1 class label | 84.5% top-1 (competitive with ViT) |
| Sintel Optical Flow | 2 video frames | Per-pixel 2D flow vectors | Competitive with RAFT |
| StarCraft II | Game state | Action predictions | Near-AlphaStar performance |
| AudioSet Classification | Raw audio waveform | Sound event labels | Strong multi-label classification |
| Language Modeling | Token sequence | Next-token predictions | Competitive (but not SOTA) on text |
| Multimodal | Video + audio + text | Joint predictions | First unified multimodal architecture |

Perceiver IO vs Specialized Models

| Aspect | Specialized Models | Perceiver IO |
|--------|-------------------|-------------|
| Architecture per task | Custom (ResNet, BERT, U-Net, RAFT) | One architecture for all tasks |
| State-of-the-art | Yes (task-specific optimization) | Near-SOTA on most tasks |
| Flexibility | Limited to designed input/output types | Any input, any output |
| Development cost | High (design + optimize per task) | Low (same architecture, swap queries) |

Perceiver IO is the most general deep learning architecture proposed to date — extending Perceiver's modality-agnostic input encoding with flexible output query decoding that produces arbitrary structured outputs, demonstrating that a single unchanged architecture can perform classification, segmentation, optical flow, language modeling, and multimodal tasks by simply changing the output query specification.

Want to learn more?