EntertainmentARTICLEWhy LLM Inference Is Memory-Bound (Julia Turc)
www.youtube.comAbout
An educational video breaking down why GPUs vastly underperform their theoretical token throughput during LLM inference. It uses the roofline model to explain memory bandwidth bottlenecks, KV caching, speculative decoding, and how diffusion-based LLMs shift the workload from memory-bound to compute-bound.
Why it made the leaderboard
Most coding-tool advice stops at prompts; this goes a layer down to explain why your GPU sits idle burning most of its FLOPs during inference — it's starved for memory bandwidth, not compute. Turc uses the roofline model to make KV caching, speculative decoding, and diffusion LLMs click as engineering responses to the same bottleneck. Watch it if you want to reason about latency, batching, and hardware choices instead of guessing.
Tags
Comments
No comments yet.