Home Knowledge Base Perceiver IO

Perceiver IO is an extension of Perceiver that adds flexible output decoding through output query arrays — enabling the same architecture to produce structured outputs of arbitrary size and type (class labels, pixel arrays, language tokens, optical flow fields) by using learned output queries that cross-attend to the latent array, making it the first truly general-purpose architecture for any input-to-any output deep learning tasks.

What Is Perceiver IO?

Architecture

StageOperationDimensionsPurpose
1. EncodeCross-attention: latent queries → inputInput: N_in × d_in → Latent: M × dCompress input into latent bottleneck
2. ProcessSelf-attention on latent array (L blocks)M × d → M × dRefine latent representations
3. DecodeCross-attention: output queries → latentLatent: M × d → Output: N_out × d_outProduce structured outputs

Output Query Design

TaskOutput QueriesWhat They RepresentOutput
Classification1 learned query vector"What class is this?"Class logits
Image SegmentationH×W query vectors (one per pixel)"What class is each pixel?"Per-pixel class labels
Optical FlowH×W×2 queries with position encoding"What is the motion at each pixel?"Per-pixel flow vectors
Language ModelingSequence of position-encoded queries"What is the next token at each position?"Token logits per position
MultimodalMixed queries for different output types"Classify image AND generate caption"Multiple heterogeneous outputs

Why Output Queries Are Powerful

PropertyStandard NetworksPerceiver IO
Output structureFixed by architecture (e.g., FC layer for classification)Any size, any structure via queries
Multiple outputsNeed separate headsSingle decoder with different queries
Output resolutionDetermined by network designDetermined by number of output queries
Cross-task architectureDifferent models per taskSame model, different output queries

Tasks Demonstrated with Single Architecture

TaskInputOutputPerceiver IO Performance
ImageNet Classification224×224 image1 class label84.5% top-1 (competitive with ViT)
Sintel Optical Flow2 video framesPer-pixel 2D flow vectorsCompetitive with RAFT
StarCraft IIGame stateAction predictionsNear-AlphaStar performance
AudioSet ClassificationRaw audio waveformSound event labelsStrong multi-label classification
Language ModelingToken sequenceNext-token predictionsCompetitive (but not SOTA) on text
MultimodalVideo + audio + textJoint predictionsFirst unified multimodal architecture

Perceiver IO vs Specialized Models

AspectSpecialized ModelsPerceiver IO
Architecture per taskCustom (ResNet, BERT, U-Net, RAFT)One architecture for all tasks
State-of-the-artYes (task-specific optimization)Near-SOTA on most tasks
FlexibilityLimited to designed input/output typesAny input, any output
Development costHigh (design + optimize per task)Low (same architecture, swap queries)

Perceiver IO is the most general deep learning architecture proposed to date — extending Perceiver's modality-agnostic input encoding with flexible output query decoding that produces arbitrary structured outputs, demonstrating that a single unchanged architecture can perform classification, segmentation, optical flow, language modeling, and multimodal tasks by simply changing the output query specification.

perceiver iofoundation model

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.