Definition
Adapters are small neural network modules inserted between layers of a pretrained model. During fine-tuning, the original model weights are frozen and only the adapter parameters are trained. This approach enables efficient transfer learning—adapting large models to new tasks using a fraction of the parameters (typically 1-5% of the original model). Multiple task-specific adapters can share the same base model, dramatically reducing storage and deployment costs.
Why it matters
Adapters transformed how we customize large models:
- Parameter efficiency — train 1-5% of parameters vs 100% in full fine-tuning
- Multi-task deployment — one base model + many lightweight adapters
- Reduced catastrophic forgetting — frozen base preserves original capabilities
- Lower storage costs — adapters are MBs vs GBs for full model copies
- Fast task switching — swap adapters at runtime for different tasks
Adapters enable practical multi-tenant deployment of foundation models.
How it works
┌────────────────────────────────────────────────────────────┐
│ ADAPTER ARCHITECTURE │
├────────────────────────────────────────────────────────────┤
│ │
│ BOTTLENECK ADAPTER DESIGN (Original 2019): │
│ ────────────────────────────────────────── │
│ │
│ Inserted after each Transformer sublayer: │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ Transformer Block │ │
│ │ │ │
│ │ ┌─────────────────────────────┐ │ │
│ │ │ Self-Attention (frozen) │ │ │
│ │ └──────────────┬───────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────┐ │ │
│ │ │ Adapter Module │ │ │
│ │ │ (TRAINABLE) │ │ │
│ │ └──────────────┬───────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────┐ │ │
│ │ │ Feed-Forward (frozen) │ │ │
│ │ └──────────────┬───────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────┐ │ │
│ │ │ Adapter Module │ │ │
│ │ │ (TRAINABLE) │ │ │
│ │ └──────────────┬───────────────┘ │ │
│ │ │ │ │
│ └─────────────────┼────────────────────┘ │
│ ▼ │
│ │
│ ADAPTER MODULE INTERNALS: │
│ ───────────────────────── │
│ │
│ Input (hidden_dim = 768) │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Down-project: 768 → 64 (bottleneck) │ │
│ │ [W_down: 768 × 64 = 49K params] │ │
│ └─────────────────┬────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Non-linearity (GELU/ReLU) │ │
│ └─────────────────┬────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Up-project: 64 → 768 │ │
│ │ [W_up: 64 × 768 = 49K params] │ │
│ └─────────────────┬────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Residual connection: output + input │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Total per adapter: ~98K params (vs millions in layer) │
│ │
│ │
│ ADAPTER FAMILY COMPARISON: │
│ ────────────────────────── │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Method │ Where inserted │ Trainable % │ │
│ ├─────────────────────────────────────────────────────┤ │
│ │ Bottleneck │ After sublayers │ 2-5% │ │
│ │ LoRA │ Parallel to W │ 0.1-1% │ │
│ │ Prefix-tuning │ Before input │ 0.1% │ │
│ │ Prompt-tuning │ Input embeddings │ <0.1% │ │
│ │ (IA)³ │ Scale activations │ <0.1% │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ MULTI-TASK DEPLOYMENT: │
│ ────────────────────── │
│ │
│ ┌──────────────────┐ │
│ │ Frozen Base Model│ │
│ │ (7B params) │ │
│ └────────┬─────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Adapter A│ │Adapter B│ │Adapter C│ │
│ │ (Code) │ │(Medical)│ │ (Legal) │ │
│ │ ~50MB │ │ ~50MB │ │ ~50MB │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ vs. Full fine-tuning: 3 × 14GB = 42GB models needed │
│ │
└────────────────────────────────────────────────────────────┘
Adapter efficiency comparison:
| Approach | Trainable Params | Storage per Task | Training Speed |
|---|---|---|---|
| Full fine-tuning | 100% | Full model | Slow |
| Bottleneck adapters | 2-5% | ~50-200MB | Fast |
| LoRA | 0.1-1% | ~10-100MB | Very fast |
| Prompt tuning | <0.1% | ~1MB | Fastest |
Common questions
Q: How is LoRA different from bottleneck adapters?
A: Bottleneck adapters add new layers (down-project → nonlinearity → up-project) in series. LoRA adds parallel low-rank updates to existing weight matrices. LoRA is even more parameter-efficient and has become more popular, but bottleneck adapters sometimes achieve better results on complex tasks due to their greater expressiveness.
Q: Can I combine multiple adapters?
A: Yes! AdapterFusion and similar techniques allow combining multiple task-specific adapters. You can also stack adapters (e.g., language adapter + task adapter) for cross-lingual transfer or domain adaptation. Some frameworks support weighted combinations at inference time.
Q: Do adapters hurt inference speed?
A: Minimally. Bottleneck adapters add sequential computation, increasing latency by 2-5%. LoRA-style adapters can be merged into base weights at deployment, adding zero inference overhead. The multi-task storage savings usually outweigh the small speed cost.
Q: When should I use adapters vs. full fine-tuning?
A: Use adapters when: (1) you have limited GPU memory, (2) you need multiple task-specific models, (3) you want to preserve base model capabilities, or (4) your fine-tuning data is limited (adapters regularize better). Use full fine-tuning when you have abundant resources and need maximum task performance.
Related terms
- LoRA — popular adapter variant using low-rank decomposition
- Fine-tuning — full parameter training adapters improve
- Transfer learning — broader concept adapters enable
- LLM — models that benefit from adapters
References
Houlsby et al. (2019), “Parameter-Efficient Transfer Learning for NLP”, ICML. [Original adapter paper]
Pfeiffer et al. (2020), “AdapterHub: A Framework for Adapting Transformers”, EMNLP. [Adapter ecosystem and fusion]
He et al. (2022), “Towards a Unified View of Parameter-Efficient Transfer Learning”, ICLR. [Comparison of adapter methods]
Hu et al. (2021), “LoRA: Low-Rank Adaptation of Large Language Models”, ICLR. [LoRA as adapter alternative]