Dosage Extraction is the clinical NLP subtask of identifying and parsing numeric dosage information — amounts, units, routes, frequencies, and dosing schedules — from medication-related clinical text — enabling accurate medication reconciliation, pharmacovigilance, pharmacoepidemiology research, and clinical decision support systems that require precise quantitative medication data rather than just drug name recognition.
What Is Dosage Extraction?
- Scope: The numeric and qualitative attributes that define how a medication is administered.
- Components: Strength (500mg), Unit (mg / mcg / mg/kg), Form (tablet / capsule / injection), Route (oral / IV / SC), Frequency (once daily / BID / q8h / PRN), Duration (7 days / 6 weeks / indefinite), Timing modifiers (with meals / at bedtime / on empty stomach).
- Benchmark Context: Sub-component of i2b2/n2c2 2009 Medication Extraction, n2c2 2018 Track 2; also evaluated in SemEval clinical NLP tasks.
- Normalization: Convert extracted dosage expressions to standardized units — "1 tab" → "500mg" (if tablet strength known); "once daily" → frequency code QD → interval 24h.
Dosage Expression Diversity
Clinical text expresses dosage in extraordinarily varied ways:
Standard Expressions:
- "Metoprolol succinate 25mg PO QAM" — straightforward.
- "Lisinopril 10mg by mouth daily" — spelled out route and frequency.
Abbreviation-Heavy:
- "ASA 81mg po qd" — aspirin, 81mg, oral, once daily.
- "Vancomycin 1.5g IVPB q12h x14d" — antibiotic, intravenous piggyback, every 12 hours for 14 days.
Weight-Based Pediatric Dosing:
- "Amoxicillin 40mg/kg/day div q8h" — dose rate + weight factor + division schedule.
- Parsing requires knowing patient weight from elsewhere in the record.
Titration Schedules:
- "Start methotrexate 7.5mg weekly, increase to 15mg after 4 weeks if tolerated" — sequential dosing with conditional escalation.
Conditional and Range Dosing:
- "Insulin lispro 4-8 units SC per sliding scale" — PRN dose range requiring glucose level context.
- "Hold if HR<60" — conditional hold modifying the base dosing instruction.
Why Dosage Extraction Is Hard
- Unit Ambiguity: "5ml" of amoxicillin suspension vs. "5ml" of IV saline — same expression, orders of magnitude different clinical implications.
- Implicit Frequency: "Continue home medications" — frequency implied but not stated.
- Abbreviated Medical Jargon: Clinical dosage abbreviations are not standardized across institutions — "QD" vs. "once daily" vs. "OD" vs. "1x/day."
- Mathematical Expressions: "0.5mg/kg twice daily" requires linking to patient weight from a different document section.
- Cross-Reference Dependency: "Same dose as prior admission" — requires retrieval from prior clinical notes.
Performance Results
| Attribute | i2b2 2009 Best System F1 |
|-----------|------------------------|
| Drug name | 93.4% |
| Dosage (amount + unit) | 88.7% |
| Route | 91.2% |
| Frequency | 85.3% |
| Duration | 72.1% |
| Reason/Indication | 68.4% |
Duration and indication are consistently the hardest attributes — they are most often implicit or require semantic inference.
Clinical Importance
- Overdose Prevention: Extracting "acetaminophen 1000mg q4h" (6g/day — above safe maximum) from a patient taking multiple formulations.
- Renal Dosing Compliance: Verify that renally cleared drugs (vancomycin, metformin, digoxin) are dose-adjusted per extracted eGFR.
- Pharmacokinetic Studies: Precise dose time-series extraction from clinical notes enables population PK modeling using real-world dosing data.
- Clinical Trial Eligibility: Trials often require specific dosage history ("on stable metformin ≥1g/day for ≥3 months") — automatic extraction makes this eligibility check scalable.
Dosage Extraction is the pharmacometric precision layer of clinical NLP — moving beyond simple drug name recognition to extract the complete quantitative dosing profile that clinical safety systems, pharmacovigilance algorithms, and medication reconciliation tools need to protect patients from dosing errors and harmful drug regimens.