← okcomputer.sh

what 8ms buys you at 65mph

a car going 65 miles per hour covers about 29 meters every second. in 8 milliseconds (the amount of inference time i saved on our latest optimization pass) the car travels roughly 23 centimeters. about 9 inches.

nine inches doesn't sound like much. but those 8ms aren't really about distance. they're about what the rest of the system gets to do with the time you gave back.

the latency budget

the FSD perception-to-action pipeline runs on a fixed time budget per camera frame. our main forward-facing cameras run at 36fps, which gives roughly 27ms per frame. in that window, the neural network takes raw camera images, runs inference, and outputs driving commands. whatever time the network doesn't use, the downstream planner gets.

before the optimization: inference took about 18ms, leaving 9ms for planning and control.

after: inference takes about 10ms, leaving 17ms.

the planner nearly doubled its time budget. that means more candidate trajectories evaluated, more safety checks, smoother control outputs. the car doesn't arrive sooner. it drives better.

where the time goes

when people hear "optimization" they think faster algorithms. sometimes. but most of the wins come from reorganizing how data moves through the chip.

Tesla's AI4 is an ASIC designed for neural network inference. it has a matrix-multiply unit (the part that does the math), on-chip SRAM (fast, close to compute, limited capacity), and off-chip DRAM (large, slow to access).

the game is keeping data in SRAM. every round trip to DRAM costs on the order of 100-200 nanoseconds. an operation on data already in SRAM takes maybe 1-2ns. that's roughly 100x.

a naive implementation runs each layer sequentially: load weights from DRAM, load activations, multiply, write results back to DRAM. the next layer loads those same results from DRAM again. data bounces in and out of slow memory between every layer. most of the wall-clock time isn't math. it's waiting.

operator fusion

the single biggest technique is operator fusion. instead of running layer norm, then attention, then projection as separate operations, each with DRAM round trips between them, you fuse them into a single operator that streams data through all three while keeping intermediates in SRAM.

this is conceptually simple and in practice a nightmare. each fused operator has to be optimized for the specific hardware. the tiling strategy (how you break the computation into chunks that fit in SRAM) depends on the exact weight matrix dimensions, the SRAM capacity, and the available bandwidth. change the model and you might need new fusion patterns.

on this pass, the biggest win came from fusing the attention mechanism differently. the standard approach fuses the Q, K, V projections together. we found a way to fuse the QKV projection with the attention score computation for the first heads in each layer, keeping the K and V matrices in SRAM across the operation boundary. that eliminated two DRAM round trips per layer.

across roughly 40 transformer layers, that adds up. 8ms.

quantization

the other major lever is quantization. the model trains in FP16 (16-bit floating point), but we run most inference in INT8 (8-bit integer). INT8 halves the memory bandwidth requirement and the multiply-accumulate units process it faster.

not everything quantizes cleanly. attention logits can have a wide dynamic range that INT8 clips. so we run a mixed precision scheme, certain operations stay in FP16, the rest quantize to INT8, calibrated per-layer against representative driving data.

calibration is fussier than people expect. a model quantized with mostly highway data will behave differently on residential streets with lots of pedestrians, because the activation distributions shift. we calibrate on a diverse dataset covering the main driving scenarios: highway, urban, construction, parking lots. the calibration set matters almost as much as the quantization scheme.

why it matters

there's a perception that inference work is "just engineering." important but not interesting. the model is the interesting part and we're plumbing.

i get it. but the model doesn't drive a car. the model running on a chip within a latency budget drives a car. every millisecond of inference time is a decision about what to compute, at what precision, with data stored where. those decisions change what the car can do.

8ms at 65mph is 23 centimeters. it's also the planner getting nearly twice the time to keep you safe. i'll take that trade.