Home Knowledge Base RAG, Retrieval, and Knowledge Bases

RAG, Retrieval, and Knowledge Bases

A comprehensive technical guide with mathematical foundations


1. Overview

RAG (Retrieval-Augmented Generation) is an architecture that enhances Large Language Models (LLMs) by grounding their responses in external knowledge sources.

Core Components


2. Mathematical Foundations

2.1 Vector Embeddings

Documents and queries are converted to dense vectors in $\mathbb{R}^d$ where $d$ is the embedding dimension (typically 384, 768, or 1536).

Embedding Function:

$$ E: \text{Text} \rightarrow \mathbb{R}^d $$

For a document $D$ and query $Q$:

$$ \vec{d} = E(D) \in \mathbb{R}^d $$

$$ \vec{q} = E(Q) \in \mathbb{R}^d $$

2.2 Similarity Metrics

Cosine Similarity

$$ \text{sim}_{\cos}(\vec{q}, \vec{d}) = \frac{\vec{q} \cdot \vec{d}}{\lVert \vec{q} \rVert \cdot \lVert \vec{d} \rVert} = \frac{\sum_{i=1}^{d} q_i \cdot d_i}{\sqrt{\sum_{i=1}^{d} q_i^2} \cdot \sqrt{\sum_{i=1}^{d} d_i^2}} $$

Euclidean Distance (L2)

$$ \text{dist}_{L2}(\vec{q}, \vec{d}) = \lVert \vec{q} - \vec{d} \rVert_2 = \sqrt{\sum_{i=1}^{d} (q_i - d_i)^2} $$

Dot Product

$$ \text{sim}_{\text{dot}}(\vec{q}, \vec{d}) = \vec{q} \cdot \vec{d} = \sum_{i=1}^{d} q_i \cdot d_i $$

2.3 BM25 (Sparse Retrieval)

$$ \text{BM25}(Q, D) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)} $$

Where:

Inverse Document Frequency (IDF):

$$ \text{IDF}(q_i) = \ln\left(\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1\right) $$

Where:


3. RAG Pipeline Architecture

3.1 Pipeline Stages

1. Indexing Phase

2. Query Phase

3.2 Retrieval Formula

Given a query $Q$ and corpus $\mathcal{C}$, retrieve top-$k$ documents:

$$ \mathcal{D}_k = \{D_1, D_2, ..., D_k\} \quad \text{where} \quad \text{sim}(Q, D_1) \geq \text{sim}(Q, D_2) \geq ... \geq \text{sim}(Q, D_k) $$

3.3 Generation with Context

$$ P(\text{Response} | Q, \mathcal{D}_k) = \text{LLM}(Q \oplus \mathcal{D}_k) $$

Where $\oplus$ denotes context concatenation.


4. Chunking Strategies

4.1 Fixed-Size Chunking

$$ \text{Number of chunks} = \left\lceil \frac{\lvert D \rvert - o}{c - o} \right\rceil $$

4.2 Semantic Chunking

$$ \text{Split at } i \quad \text{if} \quad \text{sim}(s_i, s_{i+1}) < \theta $$

4.3 Recursive Chunking


5. Knowledge Base Design

5.1 Metadata Schema

{
  "chunk_id": "string",
  "document_id": "string",
  "content": "string",
  "embedding": "vector[d]",
  "metadata": {
    "source": "string",
    "title": "string",
    "author": "string",
    "date_created": "ISO8601",
    "date_modified": "ISO8601",
    "section": "string",
    "page_number": "integer",
    "chunk_index": "integer",
    "total_chunks": "integer",
    "tags": ["string"],
    "confidence_score": "float"
  }
}

5.2 Index Types

HNSW Search Complexity:

$$ O(d \cdot \log n) $$

Where $d$ is embedding dimension and $n$ is corpus size.


6. Evaluation Metrics

6.1 Retrieval Metrics

Recall@k

$$ \text{Recall@}k = \frac{\lvert \text{Relevant} \cap \text{Retrieved@}k \rvert}{\lvert \text{Relevant} \rvert} $$

Precision@k

$$ \text{Precision@}k = \frac{\lvert \text{Relevant} \cap \text{Retrieved@}k \rvert}{k} $$

Mean Reciprocal Rank (MRR)

$$ \text{MRR} = \frac{1}{\lvert Q \rvert} \sum_{i=1}^{\lvert Q \rvert} \frac{1}{\text{rank}_i} $$

Normalized Discounted Cumulative Gain (NDCG)

$$ \text{DCG@}k = \sum_{i=1}^{k} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)} $$

$$ \text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k} $$

6.2 Generation Metrics

$$ G = \frac{\lvert \text{Claims supported by context} \rvert}{\lvert \text{Total claims} \rvert} $$


7. Advanced Techniques

7.1 Hybrid Search

Combine dense and sparse retrieval:

$$ \text{score}_{\text{hybrid}} = \alpha \cdot \text{score}_{\text{dense}} + (1 - \alpha) \cdot \text{score}_{\text{sparse}} $$

Where $\alpha \in [0, 1]$ is the weighting parameter.

7.2 Reranking

Apply cross-encoder reranking to top-$k$ results:

$$ \text{score}_{\text{rerank}}(Q, D) = \text{CrossEncoder}(Q, D) $$

Cross-encoder complexity: $O(k \cdot \lvert Q \rvert \cdot \lvert D \rvert)$

7.3 Query Expansion

$$ \vec{q}_{\text{HyDE}} = E(\text{LLM}(Q)) $$

$$ \mathcal{D}_{\text{merged}} = \bigcup_{i=1}^{m} \text{Retrieve}(Q_i) $$

7.4 Contextual Compression

Reduce retrieved context before generation:

$$ C_{\text{compressed}} = \text{Compress}(\mathcal{D}_k, Q) $$


8. Vector Database Options

DatabaseIndex TypesHostingScalability
PineconeHNSW, IVFCloudHigh
WeaviateHNSWSelf/CloudHigh
QdrantHNSWSelf/CloudHigh
MilvusIVF, HNSWSelf/CloudVery High
FAISSFlat, IVF, HNSWSelfMedium
ChromaHNSWSelfLow-Medium
pgvectorIVFFlat, HNSWSelfMedium

9. Best Practices Checklist


10. Code Examples

10.1 Cosine Similarity (Python)

import numpy as np

def cosine_similarity(vec_q: np.ndarray, vec_d: np.ndarray) -> float:
    """
    Calculate cosine similarity between two vectors.

    $$\text{sim}_{\cos}(\vec{q}, \vec{d}) = \frac{\vec{q} \cdot \vec{d}}{\lVert \vec{q} \rVert \cdot \lVert \vec{d} \rVert}$$
    """
    dot_product = np.dot(vec_q, vec_d)
    norm_q = np.linalg.norm(vec_q)
    norm_d = np.linalg.norm(vec_d)
    return dot_product / (norm_q * norm_d)

10.2 BM25 Implementation

import math
from collections import Counter

def bm25_score(
    query_terms: list[str],
    document: list[str],
    corpus: list[list[str]],
    k1: float = 1.5,
    b: float = 0.75
) -> float:
    """
    Calculate BM25 score for a query-document pair.
    """
    doc_len = len(document)
    avg_doc_len = sum(len(d) for d in corpus) / len(corpus)
    doc_freq = Counter(document)
    N = len(corpus)

    score = 0.0
    for term in query_terms:
**Document frequency**
        n_q = sum(1 for d in corpus if term in d)

**IDF calculation**
        idf = math.log((N - n_q + 0.5) / (n_q + 0.5) + 1)

**Term frequency in document**
        f_q = doc_freq.get(term, 0)

**BM25 term score**
        numerator = f_q * (k1 + 1)
        denominator = f_q + k1 * (1 - b + b * (doc_len / avg_doc_len))

        score += idf * (numerator / denominator)

    return score

References

1. Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" 2. Robertson, S., & Zaragoza, H. (2009). "The Probabilistic Relevance Framework: BM25 and Beyond" 3. Johnson, J., et al. (2019). "Billion-scale similarity search with GPUs" (FAISS) 4. Malkov, Y., & Yashunin, D. (2018). "Efficient and robust approximate nearest neighbor search using HNSW"


Document generated for VS Code with KaTeX/LaTeX math support. Render with Markdown Preview Enhanced or similar extension.

ragretrievalknowledge base

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.