Entity Extraction and NER
What is Named Entity Recognition?
NER identifies and classifies named entities in text into predefined categories like person, organization, location, date, etc.
Common Entity Types
| Entity | Examples |
|--------|----------|
| PERSON | Elon Musk, Marie Curie |
| ORG | Google, United Nations |
| LOCATION | Paris, Mount Everest |
| DATE | January 1st, 2024 |
| MONEY | $100, 50 million euros |
| PRODUCT | iPhone 15, Model S |
Approaches
Traditional NER (spaCy)
``python
import spacy
nlp = spacy.load("en_core_web_lg")
doc = nlp("Apple CEO Tim Cook announced new products in Cupertino.")
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
# Apple: ORG
# Tim Cook: PERSON
# Cupertino: GPE
`
LLM-Based Extraction
`python
def extract_entities(text: str) -> dict:
result = llm.generate(f"""
Extract entities from this text in JSON format:
{{
"persons": [],
"organizations": [],
"locations": [],
"dates": []
}}
Text: {text}
""")
return json.loads(result)
`
Structured Extraction (Instructor)
`python
from pydantic import BaseModel
import instructor
class Entities(BaseModel):
persons: list[str]
organizations: list[str]
locations: list[str]
products: list[str]
client = instructor.from_openai(OpenAI())
entities = client.chat.completions.create(
model="gpt-4o",
response_model=Entities,
messages=[{"role": "user", "content": f"Extract entities: {text}"}]
)
`
Domain-Specific NER
Custom Entity Types
`python
# Medical
entities = ["DRUG", "DISEASE", "SYMPTOM", "TREATMENT"]
# Legal
entities = ["CASE", "STATUTE", "COURT", "PARTY"]
# Financial
entities = ["TICKER", "COMPANY", "METRIC", "CURRENCY"]
`
Fine-Tuning
Train on domain-specific data:
`python``
# Training data format
[
("Aspirin reduces cold symptoms.", {"entities": [(0, 7, "DRUG"), (16, 20, "SYMPTOM")]}),
...
]
Use Cases
| Use Case | Application |
|----------|-------------|
| RAG preprocessing | Extract entities for search |
| Knowledge graph | Build entity-relation triples |
| Content indexing | Categorize documents |
| Information extraction | Structured data from text |
Best Practices
- Use traditional NER for speed on common entities
- Use LLM for complex or domain-specific extraction
- Validate and normalize extracted entities
- Handle entity linking (resolve "Apple" to specific company)