compiler lessons from Wen-mei Hwu

August 2025

i spent two years in the IMPACT group at UIUC. the group was led by Professor Wen-mei Hwu, who passed away in June 2023. i've been meaning to write something about what i learned there for a while. this isn't an obituary. there are better ones by people who knew him longer. this is about the ideas.

the first thing Hwu taught me, before any compiler theory, was a number. the ratio of DRAM access latency to arithmetic operation latency on modern hardware. when i joined the group in 2022, it was roughly 200:1 on the GPUs we were targeting. meaning: for every cycle you spend doing useful math, you could spend 200 waiting for data if it's not already close by.

that number changed how i think about everything.

most people who write neural networks think in FLOPs. how many multiply-accumulates does this layer need. but FLOPs are almost never the bottleneck on modern hardware. memory bandwidth is. data has to travel from DRAM to the compute units, and that journey is the expensive part. not the multiplication.

Hwu's career was basically this insight applied in different ways across decades. in the 90s and 2000s, the IMPACT compiler explored instruction-level parallelism, finding independent operations that could run simultaneously so the hardware stays fed. don't let compute units sit idle waiting for data.

when GPU computing took off, he pivoted. he co-authored "Programming Massively Parallel Processors" with David Kirk from NVIDIA. the book is fundamentally about memory hierarchies and data locality, applied to thousands of threads instead of a handful. i read it the summer before i joined the group and it's still the most useful textbook i own.

the thing that stuck with me most wasn't a specific technique. it was a question. whenever someone proposed an optimization, Hwu would ask: "where is the data?" not "how many operations are we doing." where is the data, and how far does it need to travel.

i think about that question every day now.

at Tesla, i optimize neural network inference on custom silicon. the specific chip is different from the GPUs i worked with at UIUC, but the physics is the same. DRAM is slow. on-chip SRAM is fast but small. the entire art of inference optimization is restructuring computation so that data stays close. fusing operators, tiling to fit in SRAM, scheduling to maximize reuse from on-chip buffers.

sometimes i'll be staring at a profile trace, trying to figure out why a particular layer is slower than it should be, and the answer is always the same answer. the data is too far away.

it's been two years. the IMPACT group continues, but it's different. i think about what he'd make of this moment. every company designing custom AI silicon, inference optimization becoming a real bottleneck at scale. he spent his career building toward this.

i don't know if he'd be impressed or frustrated. probably both. the hardware is extraordinary. the software still wastes most of it.

the lesson i keep coming back to: computation is cheap. data movement is expensive. every optimization i've shipped that actually mattered was about reducing the distance between data and compute. that's his fingerprint on my work and i think it always will be.