quantization is weird

March 2025

some notes on INT8 quantization edge cases from the past few months. nothing here is novel research. it's just stuff i ran into that wasn't obvious from the textbook description.

the outlier problem

standard post-training quantization works like this: you run a calibration dataset through the FP16 model, observe the activation ranges, and pick scale/zero-point values that map the FP16 range into INT8 ([-128, 127]). works great most of the time.

the problem is outliers. some layers, particularly in the attention mechanism, have a small number of channels with activation magnitudes 10-100x larger than the rest. if you're using per-tensor quantization (one scale for the whole tensor), that outlier sets the scale, and everything else gets crushed into a tiny fraction of the INT8 range. you lose effective precision on 99% of the channels to accommodate 1%.

the fix is per-channel quantization: each output channel gets its own scale factor. this lets the outlier channels have a wide scale while normal channels keep their precision. the cost is that your kernel needs to handle per-channel dequantization, which adds some complexity to the fused operator. on our hardware it's about a 3% throughput hit, which we earn back many times over in model accuracy.

calibration data matters more than the scheme

i spent a while comparing quantization schemes. symmetric vs asymmetric, per-tensor vs per-channel, different calibration algorithms (MinMax, entropy-based, percentile). the scheme matters, but the calibration data matters more.

we had a model that was quantized with calibration data from mostly highway driving. it performed fine on highway. on residential streets with lots of pedestrians and parked cars, the detections got noticeably worse. the activation distributions are different enough between driving scenarios that a calibration set skewed toward one scenario produces poor quantization for others.

the fix was obvious in hindsight: calibrate on a diverse dataset covering highway, urban, suburban, parking lots, construction zones. we now track calibration coverage the same way we track training data diversity. it's not glamorous but it caught two regressions that would have shipped otherwise.

dynamic ranges drift with model updates

every time the model gets retrained (which happens regularly), the activation distributions shift. not dramatically, but enough that the old quantization parameters might not be optimal. we re-run calibration after every model update, which sounds obvious, but i've seen setups where the quant config is treated as static.

in one case, a model update moved the mean activation of a specific layer by about 15%, which pushed some values outside the old INT8 range. the clipping was subtle enough that it didn't show up in aggregate accuracy metrics but it caused a specific failure mode on left turns where the model underestimated the speed of oncoming traffic. we only caught it during a validation drive. there's a left turn at San Antonio and El Camino that i already don't trust. this made it worse.

the lesson: quantization isn't a one-time step. it's maintenance. every model update needs recalibration and revalidation, and the edge cases that break are rarely the same ones twice.