LoRA (Low-Rank Adaptation)

LoRA (Low-Rank Adaptation) is the parameter-efficient fine-tuning technique that adapts a large pre-trained model to new tasks by injecting small, trainable low-rank decomposition matrices into each Transformer layer — freezing the original weights entirely while training only 0.1-1% of the total parameters, achieving fine-tuning quality comparable to full-parameter training at a fraction of the memory and compute cost.

The Low-Rank Hypothesis

Full fine-tuning updates every parameter in the model, but research shows that the weight changes (delta-W) during fine-tuning occupy a low-dimensional subspace. LoRA exploits this: instead of updating a d×d weight matrix W directly, it learns a low-rank decomposition delta-W = B × A, where A is d×r and B is r×d, with rank r << d (typically 8-64). This reduces trainable parameters from d² to 2dr — a massive compression.

How LoRA Works

1. Freeze: All original model weights W are frozen (no gradients computed).
2. Inject: For selected weight matrices (typically query and value projections in attention, plus up/down projections in MLP), add parallel low-rank branches: output = Wx + (BA)*x.
3. Train: Only matrices A and B are trained. A is initialized with random Gaussian values; B is initialized to zero (so the initial delta-W = 0, preserving the pre-trained model exactly).
4. Merge: After training, the learned delta-W = BA can be merged into the original weights: W_new = W + BA. The merged model has zero additional inference latency.

Key Hyperparameters

- Rank (r): Controls the capacity of the adaptation. r=8 works for most tasks; complex domain shifts may need r=32-64. Higher rank means more parameters but rarely improves beyond a point.
- Alpha (α): A scaling factor applied to the LoRA output: delta-W = (α/r) BA. Typical setting: α = 2*r. This controls the magnitude of the adaptation relative to the original weights.
- Target Modules: Which weight matrices receive LoRA adapters. Applying to all linear layers (attention Q/K/V/O + MLP) gives the best quality but increases parameter count.

QLoRA

Quantized LoRA loads the frozen base model in 4-bit quantization (NF4 data type) while training the LoRA adapters in full precision. This enables fine-tuning a 65B parameter model on a single 48GB GPU — a task that would otherwise require 4-8 GPUs with full fine-tuning.

Practical Advantages

- Multi-Tenant Serving: One base model serves multiple tasks by hot-swapping different LoRA adapters (each only ~10-100 MB). A single GPU can serve dozens of specialized variants.
- Composability: Multiple LoRA adapters trained for different capabilities (coding, medical, creative writing) can be merged or interpolated.
- Training Speed: 2-3x faster than full fine-tuning due to fewer gradients computed and smaller optimizer states.

LoRA is the technique that made LLM customization accessible to everyone — enabling fine-tuning of billion-parameter models on consumer hardware while preserving the full quality of the pre-trained foundation.

Want to learn more?