← okcomputer.sh

why i chose inference over training

i get asked this a lot. usually phrased as "don't you want to work on the real thing?" which tells you how people think about ML right now.

training is the glamorous side. design architectures, run massive jobs across thousands of GPUs, push scaling curves. the output is a model that can do something it couldn't before. that's the part that gets the papers and the headlines.

inference is what happens after. you take the model and make it actually run. in a datacenter, on a phone, in a car. within a latency budget, a power budget, a cost budget. nobody writes blog posts about shaving 3ms off a forward pass.

well. i guess i do now.

i chose inference because of constraints.

training is an optimization problem with one main variable: make the loss go down. you can almost always improve by throwing more compute, more data, more time at it. Rich Sutton's "The Bitter Lesson" (2019) , that general methods leveraging computation win eventually. scaling works.

inference doesn't work like that. you have a fixed piece of silicon. in my case, Tesla's AI4 chip. limited SRAM, fixed memory bandwidth, and a hard deadline: the model needs to produce an output before the next camera frame arrives. you can't add more hardware. it's bolted into the car.

so the problem becomes: same model, same math, but organized differently. fuse operators to avoid redundant memory traffic. quantize weights to INT8 so they take half the bandwidth of FP16. tile computation to fit in on-chip SRAM instead of spilling to DRAM. schedule operations to overlap data loading with compute.

it's compiler work, basically. which makes sense given where i came from.

a well-optimized inference pipeline can run 5-10x faster than a naive implementation of the same model on the same chip. not because the hardware changed. because the data movement changed. you're doing the same operations with less wasted motion.

i find that satisfying in a way that's hard to explain. everyone wants to push what models can do. i wanted to work on where they could actually run. a model that needs 8 H100s in a datacenter is a research project. a model that runs in a car on a single chip in 10ms is something people depend on.

and honestly, sitting in a car watching the model make real decisions at highway speed beats watching a loss curve go down.