The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking

Here is how the prefill versus generation split exposes GPU structural inefficiencies in AI designs. The post The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking appeared first on EDN.

The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking

Recent frontier LLM inference benchmarks have highlighted a recurring pattern. GPU-based systems deliver outstanding throughput when latency is not a concern, but their performance drops sharply once real-time response requirements are imposed.

This behavior is sometimes attributed to software inefficiencies or suboptimal system tuning. In reality, the root cause lies much deeper. It reflects a fundamental mismatch between how GPUs are architected and how autoregressive inference works.

LLM inference: Prefill versus generation

To understand this limitation, it is useful to examine the two distinct phases of LLM inference: prefill and generation.

During the prefill phase, the model processes the entire input prompt in one pass. The prompt is tokenized, embedded, and propagated through every layer of the transformer network. At each layer, the model computes the attention relationships among all tokens and builds the key-value (KV) cache, which stores the intermediate data needed for subsequent token generation.

This stage maps extremely well onto GPU hardware. GPUs were designed to execute thousands of identical operations in parallel. In the prefill phase, the model performs massive matrix multiplications over large tensors, exactly the type of workload for which GPUs excel. When all tokens are available upfront, the calculations can be distributed across tens of thousands of cores, resulting in very high arithmetic utilization.

The generation phase is fundamentally different.

Once the KV cache has been created, the model begins producing output tokens one at a time. Each token depends on all tokens that came before it. This sequential dependency means that, regardless of how much hardware is available, the model cannot generate the next token until the current one has been completed.

For every generated token, the model must read the parameters for every layer, consult the KV cache, compute the next token probabilities, and then repeat the autoregressive process. The amount of computation per token is relatively modest, but the amount of data movement remains substantial.

Two faces of GPU architecture: Why modern GPUs struggle with real-time latency constraints

This is where the GPU architecture begins to work against the workload.

GPUs achieve peak efficiency when they execute large, highly parallel workloads with regular memory access patterns. Token generation offers neither. The workload is small, inherently sequential, and dominated by repeated memory accesses rather than dense arithmetic. Many of the GPU’s compute units remain idle while the device waits for data to arrive from high-bandwidth memory.

In other words, generation is not compute-bound; it’s memory-bound.

The distinction is crucial. In a compute-bound workload, adding more arithmetic units improves performance. In a memory-bound workload, performance is limited by how quickly data can be moved to the processors. Once memory bandwidth becomes the bottleneck, additional compute resources provide diminishing returns.

This explains why GPUs can appear extraordinarily efficient when throughput is measured without latency constraints. In that scenario, inference servers are free to buffer requests and combine them into large batches. Batching allows the system to process many token streams simultaneously, effectively transforming numerous small sequential tasks into a larger parallel workload that better matches the GPU’s strengths.

The role of batch sizes in GPU’s utilization

At first glance, batching in AI inference may appear straightforward. Unlike image inference where every sample in a batch completes simultaneously, LLM inference involves many conversations progressing independently and asynchronously. Some requests finish quickly, others may continue for hundreds or even thousands of decoding iterations, and new requests may arrive continuously while older conversations are still active.

The workload therefore becomes highly dynamic and irregular. Specifically, the generation of each request ends only when the model produces a special “end-of-sequence” token indicating that the response is complete.

This characteristic fundamentally changes the nature of inference scheduling.

This is where continuous batching becomes essential. Continuous batching is the runtime orchestration algorithm responsible for managing the simultaneous execution of multiple conversations across the same accelerator resources. Instead of treating inference as a sequence of isolated batches, the scheduler continuously inserts, removes, pauses, and resumes requests as tokens are generated.

The objective is to maximize hardware utilization while minimizing user-visible latency. As batch sizes increase, hardware utilization rises and throughput improves dramatically. However, batching comes at the cost of response time.

When users expect low latency, the system cannot afford to delay requests while waiting to accumulate a large batch. Each request must be processed almost immediately. As batch sizes shrink, the GPU loses the parallelism needed to keep its compute resources busy. Utilization falls, and throughput drops accordingly.

This is the central architectural limitation of GPUs in LLM inference.

The issue becomes even more pronounced when the same accelerator must handle both prefill and generation. Prefill is a large, compute-intensive task, while generation consists of many smaller, latency-sensitive operations. When new prompts arrive, the system may need to interrupt ongoing token generation to perform prompt processing. These context switches, often referred to as preemption, increase latency and reduce efficiency further.

Inference disaggregation: A clever shortcut to mitigate GPU’s inefficiencies

To mitigate this problem, system designers have begun disaggregating inference. Instead of assigning both phases to the same accelerator pool, they dedicate one group of GPUs to prefill and another to generation. The prefill GPUs build the KV cache and transfer it to the generation GPUs, which decode tokens independently.

This separation eliminates interference between the two phases and allows each group of GPUs to operate more efficiently. Prompt processing can proceed continuously without disrupting active token generation, and generation can continue without interruption.

In controlled benchmark environments, where prompt lengths, output lengths, and request patterns are known in advance, this approach can deliver substantial improvements.

Yet the underlying limitation of GPU architectures remains.

Inference disaggregation: Does it scale in real-world applications?

The generation phase is still sequential and memory bound. No amount of software optimization can eliminate the need to read model weights and cached data for each token. The disaggregated approach simply reduces scheduling inefficiencies and isolates the phases so that GPU resources are used more effectively.

Whether this strategy can scale efficiently in real-world applications depends on workload predictability.

The real-world AI services process a highly variable mix of requests. Some consist of long prompts and short responses. Others involve short prompts and long outputs. Demand can shift rapidly over time, changing the ideal ratio between prefill and generation resources.

Adapting to these changes requires dynamically reallocating accelerators. That process is not instantaneous. Devices must be initialized, model parameters loaded, and serving infrastructure synchronized. If traffic patterns are highly volatile, the overhead of reconfiguration can offset much of the benefit.

The broader lesson is that GPU performance in LLM inference is governed by more than raw TeraFLOPS.

The prefill phase showcases the strengths of GPUs, leveraging dense matrix operations and massive parallelism. The generation phase exposes their weaknesses, forcing highly parallel processors to execute a fundamentally sequential, memory-dominated workload.

As a result, the impressive throughput numbers often reported in unconstrained benchmarks can be misleading. They reflect idealized conditions in which batching hides architectural inefficiencies. Once latency constraints are introduced, those inefficiencies become visible.

The challenge for the industry is not simply to build larger GPUs, but to develop architectures and system designs better aligned with the realities of autoregressive inference.

Until then, the most significant limitation in real-time LLM serving will remain the same: generation is a sequential, memory-bound process running on hardware originally optimized for massively parallel computation.

Lauro Rizzatti is a business development executive with Vsora, a technology company offering semiconductor solutions that redefine design performance. He is a noted chip design verification consultant and industry expert on hardware emulation.

Editor’s Note

In a two-part series, contributor Lauro Rizzattti examines how LLM inference forced changes to MLPerf benchmarking. He will illustrate the evolution of the MLPerf benchmark and detail how generative AI forced a radical shift in AI hardware evaluation in the upcoming Part 2.

Related Content

The post The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking appeared first on EDN.

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow