Entity Extraction and NER

Keywords: entity extraction,ner,parsing

Entity Extraction and NER

What is Named Entity Recognition?
NER identifies and classifies named entities in text into predefined categories like person, organization, location, date, etc.

Common Entity Types
| Entity | Examples |
|--------|----------|
| PERSON | Elon Musk, Marie Curie |
| ORG | Google, United Nations |
| LOCATION | Paris, Mount Everest |
| DATE | January 1st, 2024 |
| MONEY | $100, 50 million euros |
| PRODUCT | iPhone 15, Model S |

Approaches

Traditional NER (spaCy)
``python
import spacy

nlp = spacy.load("en_core_web_lg")
doc = nlp("Apple CEO Tim Cook announced new products in Cupertino.")

for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
# Apple: ORG
# Tim Cook: PERSON
# Cupertino: GPE
`

LLM-Based Extraction
`python
def extract_entities(text: str) -> dict:
result = llm.generate(f"""
Extract entities from this text in JSON format:
{{
"persons": [],
"organizations": [],
"locations": [],
"dates": []
}}

Text: {text}
""")
return json.loads(result)
`

Structured Extraction (Instructor)
`python
from pydantic import BaseModel
import instructor

class Entities(BaseModel):
persons: list[str]
organizations: list[str]
locations: list[str]
products: list[str]

client = instructor.from_openai(OpenAI())
entities = client.chat.completions.create(
model="gpt-4o",
response_model=Entities,
messages=[{"role": "user", "content": f"Extract entities: {text}"}]
)
`

Domain-Specific NER

Custom Entity Types
`python
# Medical
entities = ["DRUG", "DISEASE", "SYMPTOM", "TREATMENT"]

# Legal
entities = ["CASE", "STATUTE", "COURT", "PARTY"]

# Financial
entities = ["TICKER", "COMPANY", "METRIC", "CURRENCY"]
`

Fine-Tuning
Train on domain-specific data:
`python
# Training data format
[
("Aspirin reduces cold symptoms.", {"entities": [(0, 7, "DRUG"), (16, 20, "SYMPTOM")]}),
...
]
``

Use Cases
| Use Case | Application |
|----------|-------------|
| RAG preprocessing | Extract entities for search |
| Knowledge graph | Build entity-relation triples |
| Content indexing | Categorize documents |
| Information extraction | Structured data from text |

Best Practices
- Use traditional NER for speed on common entities
- Use LLM for complex or domain-specific extraction
- Validate and normalize extracted entities
- Handle entity linking (resolve "Apple" to specific company)

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT