PTQ vs. QAT: Teaching an Old Model New Tricks ·

LLM Quantization - This article is part of a series.

Part 1: Squeezing Brains: An Introduction to LLM Quantization

Part 2: This Article

FP16 to INT8 Catastrophic Forgetting of a complex model — Visualization of "catastrophic forgetting" that occurs when naive quantization shatters a complex model.

You just quantized a new Llama-3-70b from FP16 to INT8. Memory usage is slashed to half and the speed of off the track (compared to the FP16). You gave the exact same test prompt the FP16 that handled flawlessy… and the model spits out pure gibberish text.

Sentences collapse. Logic evaporates just like that. It confuses with basic facts it once knew cold. Perplexity doesn’t rise a little, but it explodes.

The model doesn’t just lose a few decimals. It lost its mind.

This is not a rare bug or any bad luck. It is what happens when you apply the simple math of quantization naively to a real LLM. The delicate change of weights and activations that made the model smart gets shattered by rounding and clipping.

So the big question is:

How do we quantize without destroying the model’s logic?

Two rival philosophies that solve this exact problem: Post-Training Quantization (PTQ), the “fix it later” school, and Quantization-Aware Training (QAT), the “train it right” school.

By the end of this post you’ll know exactly when to reach for each technique and why one can turn a genius into an idiot while the other keeps the brilliance intact.

The Core Enemy: Clipping and Outliers
#

Before we get into the two clamps, we need to understand clipping, the villain of the story.

Every quantized tensor lives inside a finite range \([r_{min}, r_{max}]\). Anything outside is clamped:

\(q = clamp \Big( round\Big( \dfrac{r-z}{s} \Big), q_{min}, q_{max} \Big) \)

where \(s = \dfrac{r_{max} - r_{min}}{q_{max} - q_{min}}\) is the scale and \(z\) is the zero point.

In LLMs, the real killer is activaion outliers. While 99% of the values in a transformer sit comfortably inside \(\pm 3\sigma\), a tiny fraction can spike to \(\pm 50\sigma\) or more, especially in attention and feed forward layers of bigger models. These spikes force the entire scale factor to stretch, crushing the common values into just a handful of coarse bins. The result, massive quantization noise exactly where the model is trying to do its most delicate work.

Illustration of why naive quantization fails on LLMs — Histogram shows a standard activation distribution with massive, distant outliers (the red spikes at ±50σ). By comparing a naive min-max calibration (which is forced to stretch the quantization grid thin) with a robust clipping strategy, it visually demonstrates how outliers destroy precision for the bulk of the data.

Post-Training Quantization (PTQ): The “Fix It Later” Way
#

PTQ is one of the most important technique in modern LLM Quantization world. You take a fully trained FP16 or FP32 model, run it once on a small calibration dataset (usually 128-512 representative sequences from C4, WikiText, or your own data), and let the framework adjust its scales.

How it works in practice
#

Observer nodes are inserted after every weight and activation tensor (PyTorch calls them MinMaxObserver, HistogramObserver, PerChannelMinMaxObserver, etc.)
During a single forward pass on calibration data, observers record statistics: min/max, 99.9th percentile (to ignore extreme outliers), or entropy (KL-divergence minimisation).
Scales and zero-points are computed once and frozen.
The model is converted to real INT8 or INT4 tensors (or fake-quantized for testing).

Hardware reality

PTQ is inference only. No gradients, no optimizer states, no extra VRAM. On an RTX 4090 you can quantize a 70B model to 4-bit in under 10 minutes and get 2–3× higher tokens/second immediately. That is why GPTQ, AWQ, and GGUF (K-Quants) are all PTQ descendants.

The Catch Because the model never saw quantization noise during training, it has zero robustness. A single bad calibration run or a domain shift (chat vs. code) can destroy performance. Naive PTQ on Llama-3-70B can easily add +0.5–1.5 perplexity points at 4-bit.

Quantization-Aware Training (QAT): The “Train It Right” Way
#

As the name says, QAT is whole different approach compared to PTQ. Instead of surprising the model with quantization at inference time, we simulate the exact quantization noise during training or fine-tuning.

The QAT Fake Quantization Mechanism — The QAT "Fake Quantization" Mechanism. Diagram shows the critical "Fake Quantize" node inserted into the computational graph.

The two magic components
#

1. Fake Quantization
#

In the forward pass we actually perform the full quantize and then dequantize round-trip:

\( \hat{r} = s \cdot round \Big( \dfrac{r - z}{s} \Big) + z \)

The output \(\hat{r}\) is still in FP32 or FP16 for the next layer, but it carries the exact rounding and clipping error the real integer hardware will see.

2. Straight-Through Estimator(STE)
#

The rounding function has derivative zero almost everywhere, where gradients would die. STE is the elegant hack:

\( \dfrac{\partial \mathcal{L}}{\partial r} \approx \dfrac{\partial \mathcal{L}}{\partial \hat{r}} \)

In code it is literally “identity in the backward pass.” PyTorch implements this with FakeQuantize modules.

Because the model now sees the quantization error every step, the optimizer gently nudges weights and activations into ranges that are quantization-friendly. Outliers are suppressed naturally. Model learns to live with the noise. This results, QAT almost always recovers within 0.1–0.3 perplexity of the FP16 baseline even at INT8, and modern variants can push 4-bit performance extremely close to full precision.

Unlike PTQ, where a decent RTX 4090 can do the quantization, you need real training compute and a few epochs of fine-tuning.

Comparison
#

Aspect	PTQ	QAT
Training required	None	Yes (fine-tuning or from-scratch)
Calibration Data	128-512 samples	Full training/fine-tuning dataset
Time to quantize	5–30 minutes	10–100+ hours
Accuracy (4-bit)	Good with smart calibration	Excellent. Learns to suppress outliers
Hardware friendliness	Perfect for inference servers	Requires full training stack
Typical use case	Deploying existing open models	New model development or domain adaptation

Short Code Demos (PyTorch)
#

PTQ with observer nodes (static quantization)

import torch
from torch.ao.quantization import prepare, convert, get_default_qconfig

model = MyTransformer()          # your FP32 model
model.qconfig = get_default_qconfig('fbgemm')
prepared_model = prepare(model, inplace=False)   # inserts observers

# Run calibration
with torch.no_grad():
    for batch in calib_loader:   # 128–512 samples
        prepared_model(batch)

quantized_model = convert(prepared_model)        # real INT8

QAT with fake quantization

from torch.ao.quantization import QConfig, FakeQuantize
from torch.ao.quantization.observer import MinMaxObserver

qconfig = QConfig(
    activation=FakeQuantize.with_args(observer=MinMaxObserver, quant_min=-128, quant_max=127),
    weight=FakeQuantize.with_args(observer=MinMaxObserver, quant_min=-127, quant_max=127)
)

model.qconfig = qconfig
qat_model = prepare(model, inplace=False)   # fake quant nodes

# Normal training loop — model now sees quantization noise
for epoch in range(num_epochs):
    for batch in train_loader:
        loss = qat_model(batch)
        loss.backward()
        optimizer.step()

quantized_model = convert(qat_model)

Takeaway
#

PTQ is often considered as pragmatic choice for 99% of today’s open source LLM deployments. It is fast, cheap, and importantly good enough with smart calibration

QAT is the gold-standard way when you control the training process. It turns quantization from disadvantage to advantage by learned features.

Understanding this split is the key that unlocks everything else in this quantization world. In next blog, we leave the theory behind and dive into group-wise quantization, K-Quants, and why a 70B model can finally process easily on your average laptop.

*Visuals generated through Google Gemini

Author

Fharook Shaik

“Learning never exhausts the mind.” - Leonardo da Vinci

LLM Quantization - This article is part of a series.

Part 1: Squeezing Brains: An Introduction to LLM Quantization

Part 2: This Article

The Core Enemy: Clipping and Outliers#

Post-Training Quantization (PTQ): The “Fix It Later” Way#

How it works in practice#

Quantization-Aware Training (QAT): The “Train It Right” Way#

The two magic components#

1. Fake Quantization#

2. Straight-Through Estimator(STE)#

Comparison#

Short Code Demos (PyTorch)#

Takeaway#