Skip to main content
AI & Machine Learning

Quantization

Reducing model precision from 32/16-bit to 8/4-bit, dramatically decreasing memory usage and speeding up inference.

Also known as: Model quantization, Weight quantization, Low-precision inference

Definition

Quantization is a model compression technique that reduces the numerical precision of neural network weights and activations from higher-bit formats (32-bit float, 16-bit float) to lower-bit representations (8-bit int, 4-bit int, or even binary). This dramatically reduces memory footprint and increases inference speed while typically maintaining acceptable accuracy. Modern quantization methods can compress LLMs by 4-8x with minimal quality loss.

Why it matters

Quantization enables practical deployment of large models:

  • Memory reduction — 4x less memory for 8-bit, 8x less for 4-bit
  • Faster inference — integer operations faster than floating point
  • Edge deployment — run LLMs on phones and embedded devices
  • Cost savings — smaller models need cheaper hardware
  • Energy efficiency — lower precision = less power consumption

Quantization makes billion-parameter models accessible on consumer hardware.

How it works

┌────────────────────────────────────────────────────────────┐
│                    QUANTIZATION BASICS                     │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  PRECISION COMPARISON:                                     │
│  ─────────────────────                                     │
│                                                            │
│  FP32 (32-bit float): ████████████████████████████████    │
│  Bits per weight: 32                                       │
│  LLaMA-7B size: 28 GB                                      │
│                                                            │
│  FP16 (16-bit float): ████████████████                    │
│  Bits per weight: 16                                       │
│  LLaMA-7B size: 14 GB                                      │
│                                                            │
│  INT8 (8-bit integer): ████████                           │
│  Bits per weight: 8                                        │
│  LLaMA-7B size: 7 GB                                       │
│                                                            │
│  INT4 (4-bit integer): ████                               │
│  Bits per weight: 4                                        │
│  LLaMA-7B size: 3.5 GB                                     │
│                                                            │
│                                                            │
│  QUANTIZATION PROCESS:                                     │
│  ─────────────────────                                     │
│                                                            │
│  Original FP32 weight: 0.12345678                         │
│                                                            │
│  1. Find range: [min_val, max_val]                        │
│     e.g., [-0.5, 0.5]                                     │
│                                                            │
│  2. Calculate scale: scale = (max - min) / (2^bits - 1)   │
│     For INT8: scale = 1.0 / 255 ≈ 0.00392                 │
│                                                            │
│  3. Quantize: q = round((val - min) / scale)              │
│     0.12345678 → round((0.12345678 + 0.5) / 0.00392)      │
│     = round(158.8) = 159                                  │
│                                                            │
│  4. Store as integer: 159 (uses only 8 bits)              │
│                                                            │
│  5. Dequantize: val = q × scale + min                     │
│     159 × 0.00392 - 0.5 = 0.12328 ≈ 0.12345678 ✓         │
│                                                            │
│                                                            │
│  QUANTIZATION TYPES:                                       │
│  ───────────────────                                       │
│                                                            │
│  ┌──────────────────────────────────────────────────────┐ │
│  │ Post-Training Quantization (PTQ)                     │ │
│  │ ────────────────────────────────                     │ │
│  │ • Quantize AFTER training (no retraining needed)     │ │
│  │ • Fast, simple, works well for 8-bit                 │ │
│  │ • May lose accuracy at 4-bit                         │ │
│  │                                                       │ │
│  │ Trained Model ─────► Calibration ─────► Quantized    │ │
│  │                      (sample data)                    │ │
│  └──────────────────────────────────────────────────────┘ │
│                                                            │
│  ┌──────────────────────────────────────────────────────┐ │
│  │ Quantization-Aware Training (QAT)                    │ │
│  │ ─────────────────────────────────                    │ │
│  │ • Simulate quantization DURING training              │ │
│  │ • Model learns to be robust to quantization          │ │
│  │ • Better accuracy, but requires training             │ │
│  │                                                       │ │
│  │ Training with ──► Fake Quantize ──► Real Quantize   │ │
│  │ simulated low prec.  (gradients)     (deployment)    │ │
│  └──────────────────────────────────────────────────────┘ │
│                                                            │
│                                                            │
│  POPULAR QUANTIZATION METHODS:                             │
│  ─────────────────────────────                             │
│                                                            │
│  LLM.int8()   - 8-bit with outlier handling               │
│  GPTQ         - 4-bit with layer-wise calibration         │
│  AWQ          - 4-bit activation-aware quantization       │
│  NF4          - 4-bit optimized for normal distribution   │
│  GGML/GGUF   - CPU-optimized quantization formats         │
│                                                            │
│  ACCURACY vs COMPRESSION TRADEOFF:                         │
│  ─────────────────────────────────                         │
│                                                            │
│  Precision │ Size │ Speed │ Quality Loss                  │
│  ──────────┼──────┼───────┼──────────────                 │
│  FP16      │ 1x   │ 1x    │ ~0%                           │
│  INT8      │ 0.5x │ 2-3x  │ ~0-1%                         │
│  INT4      │ 0.25x│ 3-4x  │ ~1-3%                         │
│  INT2      │ 0.125x│ 4-5x │ ~5-10%+                       │
│                                                            │
└────────────────────────────────────────────────────────────┘

Common questions

Q: Does quantization always hurt model quality?

A: For 8-bit quantization, quality loss is typically negligible (<0.5%). For 4-bit, modern methods like GPTQ and AWQ achieve 1-3% degradation on benchmarks. The impact varies by task—factual recall may suffer more than general fluency. Always benchmark on your specific use case.

Q: Which quantization method should I use?

A: For serving on GPUs, GPTQ or AWQ are popular choices. For CPU inference or consumer devices, GGUF format works well. For fine-tuning, NF4 (used in QLoRA) preserves trainability. Each has tradeoffs between speed, quality, and compatibility.

Q: Can I quantize any model?

A: Most modern transformers quantize well. Smaller models (< 7B parameters) may suffer more from aggressive quantization. Very old architectures without layer normalization can be problematic. When in doubt, benchmark your specific model.

Q: What’s mixed-precision quantization?

A: Different parts of the model use different precisions. For example, keeping attention layers in higher precision while quantizing MLPs more aggressively. This can give better quality/speed tradeoffs than uniform quantization.

  • QLoRA — quantization combined with LoRA for efficient fine-tuning
  • Model compression — broader category including quantization
  • Inference — where quantization benefits are realized
  • LLM — models commonly quantized for deployment

References

Dettmers et al. (2022), “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale”, NeurIPS. [Foundational 8-bit quantization]

Frantar et al. (2022), “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”, ICLR. [Popular 4-bit method]

Lin et al. (2023), “AWQ: Activation-aware Weight Quantization”, arXiv. [Advanced 4-bit quantization]

Jacob et al. (2018), “Quantization and Training of Neural Networks for Efficient Inference”, CVPR. [Foundational quantization techniques]