Vision-Language Models (CLIP, LLaVA, Flamingo, GPT-4V)

Keywords: vision language model clip llava,flamingo multimodal model,gpt4v vision language,visual question answering vlm,multimodal large language model

Vision-Language Models (CLIP, LLaVA, Flamingo, GPT-4V) is a class of multimodal AI systems that jointly process visual and textual information, enabling tasks such as image captioning, visual question answering, zero-shot image classification, and open-ended visual reasoning β€” representing the convergence of computer vision and natural language processing into unified architectures.

CLIP: Contrastive Language-Image Pre-training

- Architecture: Dual-encoder model with a vision encoder (ViT-L/14 or ResNet) and a text encoder (Transformer) trained to align image and text representations in a shared embedding space
- Training: Contrastive learning on 400 million image-text pairs from the internet; each image-text pair is a positive, all other combinations in the batch are negatives
- Zero-shot classification: Classify images by comparing image embeddings to text embeddings of class descriptions (e.g., "a photo of a dog")β€”no task-specific training required
- Transfer breadth: Strong zero-shot performance across 30+ vision benchmarks; competitive with supervised ResNet-50 on ImageNet without seeing any ImageNet training data
- Limitations: Struggles with fine-grained spatial reasoning, counting, attribute binding, and compositional understanding
- SigLIP: Sigmoid loss variant replacing softmax-based contrastive loss, enabling more flexible batch construction and improved performance

LLaVA: Large Language and Vision Assistant

- Architecture: Connects a pretrained CLIP vision encoder to a pretrained LLM (Vicuna/LLaMA) via a trainable linear projection layer
- Training pipeline: (1) Feature alignment pretraining on 558K image-caption pairs (train projection only), (2) Visual instruction tuning on 150K GPT-4-generated visual conversations (train projection + LLM)
- Visual instruction tuning: GPT-4 generates diverse question-answer pairs about images, creating instruction-following data for visual reasoning
- LLaVA-1.5: Improves with MLP projection (instead of linear), higher resolution (336β†’672 pixels), and academic-task-specific training data
- LLaVA-NeXT: Dynamic high-resolution processing via image slicing (AnyRes), improved OCR and document understanding
- Cost efficiency: Full LLaVA training costs ~$100 in compute (single 8-GPU node, 1 day), democratizing VLM research

Flamingo and Few-Shot Visual Learning

- Perceiver Resampler: Converts variable-length visual features into a fixed number of visual tokens (64 tokens per image) via cross-attention
- Interleaved attention: Gated cross-attention layers inserted between frozen LLM layers allow visual information to condition text generation without modifying the base LLM
- Few-shot capability: Achieves strong performance with just 4-32 image-text examples in contextβ€”no gradient updates required
- Multi-image understanding: Natively processes sequences of interleaved images and text, enabling video understanding and multi-image reasoning
- Frozen LLM: The base language model (Chinchilla 80B) remains frozen; only cross-attention and perceiver parameters are trained

GPT-4V and Commercial Multimodal Systems

- Capabilities: Processes images, charts, documents, screenshots, handwriting, and diagrams with sophisticated reasoning and detailed descriptions
- Spatial reasoning: Improved understanding of spatial relationships, object counting, and visual grounding compared to earlier VLMs
- OCR and document understanding: Reads text in images including tables, receipts, code screenshots, and mathematical notation
- Safety measures: Built-in refusal for identifying real people, generating harmful content, and processing certain sensitive image categories
- GPT-4o (Omni): Natively multimodal (image, audio, video, text) trained end-to-end rather than composing separate vision and language modules

Architectural Approaches and Design Choices

- Encoder-decoder fusion: Cross-attention between visual and text features (Flamingo, BLIP-2)
- Early fusion: Treat image patches as tokens concatenated with text tokens in a single transformer (Fuyu, Gemini)
- Late fusion: Separate encoders with alignment in embedding space (CLIP, SigLIP)
- Q-Former: BLIP-2's lightweight querying transformer that bridges frozen vision encoder and frozen LLM with 188M trainable parameters
- Resolution handling: Dynamic tiling (LLaVA-NeXT), multi-scale features (InternVL), or native high-resolution encoders (PaLI-X at 756x756)

Evaluation and Benchmarks

- Visual QA: VQAv2, OK-VQA (outside knowledge), TextVQA (reading text in images)
- Holistic evaluation: MMBench, SEED-Bench, and MM-Vet test diverse capabilities including OCR, spatial reasoning, and knowledge
- Hallucination: POPE and CHAIR benchmarks measure how often VLMs hallucinate objects not present in the image
- Document understanding: DocVQA, ChartQA, and InfographicVQA evaluate structured visual understanding

Vision-language models have rapidly evolved from zero-shot classifiers to general-purpose visual reasoning engines, with open models like LLaVA closing the gap to commercial systems and enabling accessible multimodal AI research and applications across science, education, and industry.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT