Definition

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that freezes original model weights and injects trainable low-rank decomposition matrices into each layer. Instead of updating billions of parameters, LoRA trains two small matrices (A and B) whose product approximates the weight updates. This reduces memory requirements by 10,000x while achieving performance comparable to full fine-tuning.

Why it matters

LoRA revolutionized LLM customization:

Dramatic efficiency — fine-tune 65B models on consumer GPUs
Storage savings — adapters are just megabytes vs gigabytes for full models
Easy switching — swap LoRA adapters for different tasks at runtime
No inference overhead — merge adapters into base weights
Foundation for hosting — enables multi-tenant model serving

LoRA democratized access to custom LLMs by making fine-tuning affordable.

How it works

┌────────────────────────────────────────────────────────────┐
│                    LORA ARCHITECTURE                       │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  FULL FINE-TUNING (Traditional):                           │
│  ───────────────────────────────                           │
│                                                            │
│  Input ──► [W + ΔW] ──► Output                            │
│            ▲                                               │
│            │ Update ALL weights                            │
│            │ (billions of parameters)                      │
│            │ Memory: ~100+ GB                              │
│                                                            │
│  LORA (Efficient):                                         │
│  ─────────────────                                         │
│                                                            │
│  Input ──┬──► [W frozen] ────┬──► Output                  │
│          │                    │                            │
│          └──► [A × B] ───────┘                            │
│               ▲                                            │
│               │ Train only A and B                         │
│               │ (millions of parameters)                   │
│               │ Memory: ~1-10 GB                           │
│                                                            │
│  LOW-RANK DECOMPOSITION:                                   │
│  ───────────────────────                                   │
│                                                            │
│  Original weight update ΔW ≈ A × B                        │
│                                                            │
│  W: [4096 × 4096] = 16.7M params (frozen)                 │
│  ↓                                                        │
│  A: [4096 × 16] = 65K params (trainable)                  │
│  B: [16 × 4096] = 65K params (trainable)                  │
│  ↓                                                        │
│  Total trainable: 130K vs 16.7M = 0.78% of original       │
│                                                            │
│  RANK (r) = 16 is the "bottleneck dimension"              │
│                                                            │
│                                                            │
│  DIMENSION VISUALIZATION:                                  │
│  ────────────────────────                                  │
│                                                            │
│  Full ΔW:          LoRA Approximation:                    │
│  ┌─────────┐       ┌──┐                                   │
│  │         │       │  │   ┌─────────┐                     │
│  │  4096   │   =   │A │ × │    B    │                     │
│  │    ×    │       │  │   │ 16×4096 │                     │
│  │  4096   │       │4096│ └─────────┘                     │
│  │         │       │×16│                                  │
│  └─────────┘       └──┘                                   │
│   16.7M params     130K params total                      │
│                                                            │
│  INFERENCE TIME:                                           │
│  ───────────────                                           │
│                                                            │
│  Option 1: Keep separate (switch adapters)                 │
│  Output = W×x + (A×B)×x                                   │
│                                                            │
│  Option 2: Merge (no overhead)                             │
│  W_merged = W + A×B                                       │
│  Output = W_merged × x                                    │
│                                                            │
│  TYPICAL HYPERPARAMETERS:                                  │
│  ────────────────────────                                  │
│  • Rank (r): 8-64 (higher = more capacity)                │
│  • Alpha (α): scaling factor, often α = r                 │
│  • Target modules: q_proj, v_proj (attention layers)      │
│  • Learning rate: 1e-4 to 3e-4                            │
│                                                            │
└────────────────────────────────────────────────────────────┘

LoRA efficiency comparison:

Metric	Full Fine-tune	LoRA
Trainable params	100%	0.1-1%
GPU memory (7B)	~60GB	~8GB
Training time	Hours	Minutes
Adapter size	Full model	~100MB

Common questions

Q: How do I choose the right rank (r)?

A: Start with r=8 or r=16 for most tasks. Higher ranks (32, 64) capture more complex adaptations but need more memory and risk overfitting on small datasets. For simple tasks (style adaptation), r=4 may suffice. For complex domain knowledge, try r=64+. Experiment and compare validation loss.

Q: Which layers should I apply LoRA to?

A: Typically query (q_proj) and value (v_proj) projections in attention layers. Research shows these capture most task-specific information. You can add key (k_proj), output (o_proj), and MLP layers for more capacity. More layers = more trainable parameters = more memory.

Q: Can I combine multiple LoRA adapters?

A: Yes! You can add, subtract, or interpolate LoRA adapters. This enables combining skills (e.g., coding + German language), or interpolating between styles. Some frameworks support loading multiple adapters with different weights at inference time.

Q: How does LoRA compare to full fine-tuning quality?

A: For most tasks, LoRA achieves 90-100% of full fine-tuning performance. On some complex tasks requiring deep architectural changes, full fine-tuning may still win. But the efficiency gains (100x less memory) make LoRA the default choice for most applications.

Fine-tuning — traditional approach LoRA improves
QLoRA — LoRA with quantization for even more efficiency
Adapter — broader category of efficient tuning methods
LLM — models that benefit from LoRA

References

Hu et al. (2021), “LoRA: Low-Rank Adaptation of Large Language Models”, ICLR. [Original LoRA paper]

Dettmers et al. (2023), “QLoRA: Efficient Finetuning of Quantized LLMs”, NeurIPS. [LoRA + quantization]

Lialin et al. (2023), “Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning”, arXiv. [Survey of efficient fine-tuning methods]

Hugging Face (2023), “PEFT: Parameter-Efficient Fine-Tuning”, GitHub. [Popular LoRA implementation]

Definition

Why it matters

How it works

Common questions

Related terms

References