Definition

Adapters are small neural network modules inserted between layers of a pretrained model. During fine-tuning, the original model weights are frozen and only the adapter parameters are trained. This approach enables efficient transfer learning—adapting large models to new tasks using a fraction of the parameters (typically 1-5% of the original model). Multiple task-specific adapters can share the same base model, dramatically reducing storage and deployment costs.

Why it matters

Adapters transformed how we customize large models:

Parameter efficiency — train 1-5% of parameters vs 100% in full fine-tuning
Multi-task deployment — one base model + many lightweight adapters
Reduced catastrophic forgetting — frozen base preserves original capabilities
Lower storage costs — adapters are MBs vs GBs for full model copies
Fast task switching — swap adapters at runtime for different tasks

Adapters enable practical multi-tenant deployment of foundation models.

How it works

┌────────────────────────────────────────────────────────────┐
│                    ADAPTER ARCHITECTURE                    │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  BOTTLENECK ADAPTER DESIGN (Original 2019):                │
│  ──────────────────────────────────────────                │
│                                                            │
│  Inserted after each Transformer sublayer:                 │
│                                                            │
│  ┌─────────────────────────────────────┐                  │
│  │         Transformer Block           │                  │
│  │                                      │                  │
│  │  ┌─────────────────────────────┐    │                  │
│  │  │    Self-Attention (frozen)   │    │                  │
│  │  └──────────────┬───────────────┘    │                  │
│  │                 │                    │                  │
│  │                 ▼                    │                  │
│  │  ┌─────────────────────────────┐    │                  │
│  │  │    Adapter Module           │    │                  │
│  │  │    (TRAINABLE)              │    │                  │
│  │  └──────────────┬───────────────┘    │                  │
│  │                 │                    │                  │
│  │                 ▼                    │                  │
│  │  ┌─────────────────────────────┐    │                  │
│  │  │    Feed-Forward (frozen)     │    │                  │
│  │  └──────────────┬───────────────┘    │                  │
│  │                 │                    │                  │
│  │                 ▼                    │                  │
│  │  ┌─────────────────────────────┐    │                  │
│  │  │    Adapter Module           │    │                  │
│  │  │    (TRAINABLE)              │    │                  │
│  │  └──────────────┬───────────────┘    │                  │
│  │                 │                    │                  │
│  └─────────────────┼────────────────────┘                  │
│                    ▼                                       │
│                                                            │
│  ADAPTER MODULE INTERNALS:                                 │
│  ─────────────────────────                                 │
│                                                            │
│  Input (hidden_dim = 768)                                  │
│         │                                                  │
│         ▼                                                  │
│  ┌──────────────────────────────────────────┐             │
│  │  Down-project: 768 → 64 (bottleneck)     │             │
│  │  [W_down: 768 × 64 = 49K params]         │             │
│  └─────────────────┬────────────────────────┘             │
│                    │                                       │
│                    ▼                                       │
│  ┌──────────────────────────────────────────┐             │
│  │  Non-linearity (GELU/ReLU)               │             │
│  └─────────────────┬────────────────────────┘             │
│                    │                                       │
│                    ▼                                       │
│  ┌──────────────────────────────────────────┐             │
│  │  Up-project: 64 → 768                    │             │
│  │  [W_up: 64 × 768 = 49K params]           │             │
│  └─────────────────┬────────────────────────┘             │
│                    │                                       │
│                    ▼                                       │
│  ┌──────────────────────────────────────────┐             │
│  │  Residual connection: output + input     │             │
│  └──────────────────────────────────────────┘             │
│                                                            │
│  Total per adapter: ~98K params (vs millions in layer)    │
│                                                            │
│                                                            │
│  ADAPTER FAMILY COMPARISON:                                │
│  ──────────────────────────                                │
│                                                            │
│  ┌─────────────────────────────────────────────────────┐  │
│  │ Method        │ Where inserted    │ Trainable %    │  │
│  ├─────────────────────────────────────────────────────┤  │
│  │ Bottleneck    │ After sublayers   │ 2-5%           │  │
│  │ LoRA          │ Parallel to W     │ 0.1-1%         │  │
│  │ Prefix-tuning │ Before input      │ 0.1%           │  │
│  │ Prompt-tuning │ Input embeddings  │ <0.1%          │  │
│  │ (IA)³         │ Scale activations │ <0.1%          │  │
│  └─────────────────────────────────────────────────────┘  │
│                                                            │
│  MULTI-TASK DEPLOYMENT:                                    │
│  ──────────────────────                                    │
│                                                            │
│              ┌──────────────────┐                         │
│              │ Frozen Base Model│                         │
│              │    (7B params)   │                         │
│              └────────┬─────────┘                         │
│                       │                                    │
│       ┌───────────────┼───────────────┐                   │
│       │               │               │                    │
│       ▼               ▼               ▼                    │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐               │
│  │Adapter A│    │Adapter B│    │Adapter C│               │
│  │ (Code)  │    │(Medical)│    │ (Legal) │               │
│  │  ~50MB  │    │  ~50MB  │    │  ~50MB  │               │
│  └─────────┘    └─────────┘    └─────────┘               │
│                                                            │
│  vs. Full fine-tuning: 3 × 14GB = 42GB models needed     │
│                                                            │
└────────────────────────────────────────────────────────────┘

Adapter efficiency comparison:

Approach	Trainable Params	Storage per Task	Training Speed
Full fine-tuning	100%	Full model	Slow
Bottleneck adapters	2-5%	~50-200MB	Fast
LoRA	0.1-1%	~10-100MB	Very fast
Prompt tuning	`<0.1%`	~1MB	Fastest

Common questions

Q: How is LoRA different from bottleneck adapters?

A: Bottleneck adapters add new layers (down-project → nonlinearity → up-project) in series. LoRA adds parallel low-rank updates to existing weight matrices. LoRA is even more parameter-efficient and has become more popular, but bottleneck adapters sometimes achieve better results on complex tasks due to their greater expressiveness.

Q: Can I combine multiple adapters?

A: Yes! AdapterFusion and similar techniques allow combining multiple task-specific adapters. You can also stack adapters (e.g., language adapter + task adapter) for cross-lingual transfer or domain adaptation. Some frameworks support weighted combinations at inference time.

Q: Do adapters hurt inference speed?

A: Minimally. Bottleneck adapters add sequential computation, increasing latency by 2-5%. LoRA-style adapters can be merged into base weights at deployment, adding zero inference overhead. The multi-task storage savings usually outweigh the small speed cost.

Q: When should I use adapters vs. full fine-tuning?

A: Use adapters when: (1) you have limited GPU memory, (2) you need multiple task-specific models, (3) you want to preserve base model capabilities, or (4) your fine-tuning data is limited (adapters regularize better). Use full fine-tuning when you have abundant resources and need maximum task performance.

LoRA — popular adapter variant using low-rank decomposition
Fine-tuning — full parameter training adapters improve
Transfer learning — broader concept adapters enable
LLM — models that benefit from adapters

References

Houlsby et al. (2019), “Parameter-Efficient Transfer Learning for NLP”, ICML. [Original adapter paper]

Pfeiffer et al. (2020), “AdapterHub: A Framework for Adapting Transformers”, EMNLP. [Adapter ecosystem and fusion]

He et al. (2022), “Towards a Unified View of Parameter-Efficient Transfer Learning”, ICLR. [Comparison of adapter methods]

Hu et al. (2021), “LoRA: Low-Rank Adaptation of Large Language Models”, ICLR. [LoRA as adapter alternative]

Definition

Why it matters

How it works

Common questions

Related terms

References