LoRA (Low-Rank Adaptation)

LoRA (Low-Rank Adaptation) is the parameter-efficient fine-tuning technique that adapts large language models to specific tasks by injecting small trainable low-rank matrices into frozen pre-trained weight matrices — training only 0.1-1% of the total parameters while achieving fine-tuning quality comparable to full parameter updates, enabling single-GPU fine-tuning of models that would otherwise require multi-GPU setups for full fine-tuning.

The Core Idea

Instead of updating a large weight matrix W (d × d, millions of parameters), LoRA freezes W and adds a low-rank update: W' = W + BA, where B is d×r and A is r×d, with rank r << d (typically r=8-64). Only B and A are trained — r×d + d×r = 2×d×r trainable parameters vs. d² for full fine-tuning.

Why Low-Rank Works

Research showed that the weight updates during fine-tuning have low intrinsic dimensionality — the meaningful changes live in a low-dimensional subspace. A rank-16 LoRA adaptation of a 4096×4096 weight matrix trains 131K parameters (2×4096×16) instead of 16.7M — a 128× reduction — while capturing the essential task-specific adaptation.

Implementation Details

- Injection Points: LoRA adapters are typically applied to the attention projection matrices (W_Q, W_K, W_V, W_O) and sometimes the FFN layers. Applying to all linear layers (QKV + FFN) gives the best quality.
- Initialization: A initialized with random Gaussian; B initialized to zero. This ensures the adaptation starts as the identity (W + BA = W + 0 = W), preserving the pre-trained model behavior at the start of training.
- Scaling Factor: The LoRA output is scaled by α/r, where α is a hyperparameter (typically α = 2×r). This controls the magnitude of the adaptation relative to the frozen weights.
- Merging: After training, BA can be merged into W (W_deployed = W + BA). The merged model has zero inference overhead — no additional latency compared to the original model.

QLoRA (Quantized LoRA)

Combines LoRA with aggressive quantization: the base model weights are quantized to 4-bit NormalFloat (NF4) format while LoRA adapters remain in FP16/BF16. This enables fine-tuning a 65B parameter model on a single 48GB GPU:
- Base model: 65B params × 4 bits = ~32 GB
- LoRA adapters: ~100M params × 16 bits = ~200 MB
- Optimizer states: ~100M params × 32 bits = ~400 MB
- Total: ~33 GB (fits on one A6000/A100-40GB)

Multi-LoRA Serving

Multiple LoRA adapters (for different tasks or users) can share the same base model in memory. At inference, the appropriate adapter is selected and applied dynamically. S-LoRA and Punica frameworks efficiently serve thousands of LoRA adapters simultaneously, batching requests across different adapters with minimal overhead.

Comparison with Other PEFT Methods

| Method | Trainable Params | Inference Overhead | Quality |
|--------|-----------------|-------------------|---------|
| Full Fine-tuning | 100% | None | Best |
| LoRA (r=16) | 0.1-1% | None (merged) | Near-best |
| QLoRA | 0.1-1% | Quantization penalty | Good |
| Prefix Tuning | <0.1% | Slight (prefix tokens) | Good |
| Adapters | 1-5% | Slight (extra layers) | Good |

LoRA is the democratization of LLM fine-tuning — the technique that made it possible for researchers and small teams to customize billion-parameter models on consumer hardware, turning fine-tuning from a datacenter-scale operation into a single-GPU afternoon task.

Want to learn more?