Methodology

Probing trains classifiers on internal model representations to discover what information is encoded. Methodology: Extract hidden states from model, train simple classifier (linear probe) to predict linguistic/semantic properties, high accuracy indicates information is encoded. Probing tasks: Part-of-speech, syntax trees, semantic roles, coreference, factual knowledge, sentiment, entity types. Why linear probes?: Simple classifiers prevent decoder from "learning" features not present in representations. Interpretation: Good probe accuracy ≠ model uses that information. Information may be encoded but unused. Control tasks: Use random labels to establish baseline, Adi et al. selectivity measure. Layer analysis: Probe each layer to see where features emerge and dissipate. Syntax often in middle layers, semantics later. Beyond classification: Structural probes for geometry, causal probes with interventions. Tools: HuggingFace transformers + sklearn, specialized probing libraries. Limitations: Probing may find features model doesn't use, linear assumption may miss complex encoding. Applications: Understand model internals, compare architectures, analyze training dynamics. Core technique in BERTology and representation analysis.

Want to learn more?