EntertainmentARTICLE

Why LLM Inference Is Memory-Bound (Julia Turc)

www.youtube.com

EntertainmentFreemiumARTICLE3h ago

About

An educational video breaking down why GPUs vastly underperform their theoretical token throughput during LLM inference. It uses the roofline model to explain memory bandwidth bottlenecks, KV caching, speculative decoding, and how diffusion-based LLMs shift the workload from memory-bound to compute-bound.

Why it made the leaderboard

Most coding-tool advice stops at prompts; this goes a layer down to explain why your GPU sits idle burning most of its FLOPs during inference — it's starved for memory bandwidth, not compute. Turc uses the roofline model to make KV caching, speculative decoding, and diffusion LLMs click as engineering responses to the same bottleneck. Watch it if you want to reason about latency, batching, and hardware choices instead of guessing.

Comments

No comments yet.