nlpaug

nlpaug is a Python library specifically designed for augmenting text data in NLP pipelines — providing character-level (typo simulation, keyboard errors), word-level (synonym replacement via WordNet or word embeddings, random insertion/deletion/swap), and sentence-level (back-translation, contextual word replacement using BERT/GPT-2) augmentation techniques that generate diverse synthetic training examples to reduce overfitting and improve model robustness on text classification, named entity recognition, and other NLP tasks.

What Is nlpaug?

- Definition: An open-source Python library (pip install nlpaug) that provides a unified API for augmenting text data at three granularity levels — character, word, and sentence — using rule-based, embedding-based, and transformer-based approaches.
- Why Text Augmentation?: Unlike images (flip, rotate, crop), text augmentation is harder — changing a word can change meaning entirely. nlpaug provides linguistically-aware augmentation that preserves semantic meaning while creating lexical diversity.
- The Problem It Solves: NLP models overfit on small datasets because they memorize exact word sequences. Augmentation forces models to generalize beyond the specific words used in training examples.

Three Augmentation Levels

| Level | Technique | Example | Preserves Meaning? |
|-------|-----------|---------|-------------------|
| Character | Keyboard error | "hello" → "heklo" | Mostly (simulates typos) |
| Character | OCR error | "hello" → "he11o" | Mostly (simulates scan errors) |
| Character | Random insert/delete | "hello" → "helllo" | Mostly |
| Word | Synonym (WordNet) | "The quick fox" → "The fast fox" | Yes |
| Word | Word embedding (Word2Vec) | "happy" → "joyful" | Yes |
| Word | TF-IDF based | Replace low-TF-IDF words | Yes |
| Word | Random swap | "I love cats" → "love I cats" | Partial |
| Sentence | Back-translation | "I love cats" → "J'adore les chats" → "I adore cats" | Yes |
| Sentence | Contextual (BERT) | "The [MASK] fox" → "The brown fox" | Usually |
| Sentence | Abstractive summarization | Rephrase entire sentence | Yes |

Code Examples

``python import nlpaug.augmenter.word as naw import nlpaug.augmenter.char as nac import nlpaug.augmenter.sentence as nas

# Synonym replacement (WordNet) aug = naw.SynonymAug(aug_src='wordnet') aug.augment("The quick brown fox jumps over the lazy dog.") # "The fast brown fox leaps over the lazy dog."

# Contextual word replacement (BERT) aug = naw.ContextualWordEmbsAug( model_path='bert-base-uncased', action='substitute' ) aug.augment("The weather is nice today.") # "The weather is pleasant today."

# Character-level keyboard errors aug = nac.KeyboardAug() aug.augment("Machine learning is powerful.") # "Machone learning is powerfyl."``

nlpaug vs Alternatives

| Library | Strengths | Limitations |
|---------|-----------|-------------|
| nlpaug | Unified API, three levels, transformer support | Slower for BERT-based augmentation |
| TextAttack | Adversarial examples + augmentation | More complex API |
| EDA (Easy Data Augmentation) | Dead simple, 4 operations | No embedding/transformer support |
| AugLy (Meta) | Multi-modal (text + image + audio) | Heavier dependency |
| Custom Back-Translation | Highest quality paraphrases | Requires translation API/model |

When to Use nlpaug

| Scenario | Recommended Augmenter | Why |
|----------|---------------------|-----|
| Small dataset (<1K examples) | Synonym + Back-translation | Maximum diversity with meaning preservation |
| Typo robustness | Character-level keyboard aug | Train model to handle real-world typos |
| Text classification | Word-level synonym + contextual | Diverse lexical variation |
| NER / Token classification | Character-level only | Word-level changes can shift entity boundaries |

nlpaug is the standard Python library for NLP data augmentation — providing a clean, unified API across character, word, and sentence-level augmentation that generates linguistically diverse training examples, with transformer-based contextual augmentation (BERT, GPT-2) producing the highest-quality synthetic text for improving model robustness on small NLP datasets.

Want to learn more?