GPUs: A high-throughput architecture confronting a workload shift

Frontier LLMs are evolving away from the dense and homogeneous AI workloads that originally favored GPU architectures. The post GPUs: A high-throughput architecture confronting a workload shift appeared first on EDN.

GPUs: A high-throughput architecture confronting a workload shift

There is a growing architectural tension at the heart of modern AI infrastructure. The processors that enabled the deep learning revolution—graphics processing units (GPUs)—remain the dominant engines of large-scale training and inference. Yet the computational profile of frontier language models is evolving in ways that increasingly expose the structural assumptions embedded in GPU design.

Memory wall undermining GPU efficiency in LLMs

A profound bottleneck lies on the memory wall, the growing performance gap where processors can execute arithmetic operations far faster than memory systems can supply data, causing increasingly powerful compute units to sit idle while waiting on bandwidth- and latency-limited data movement.

Using the Nvidia H100 as a reference point, modern GPUs deliver multiple petaflops of FP8 tensor throughput and several terabytes per second of high-bandwidth memory access. On paper, arithmetic capacity is immense. In practice, trillion-parameter-class large language models (LLMs) are frequently memory-bound. Arithmetic intensity during inference can fall below 10 FLOPs per byte, which means that performance is limited less by compute units and more by how quickly parameters can be fetched and activations moved.

Energy considerations reinforce this imbalance. A floating-point multiply-accumulate is inexpensive relative to a high bandwidth memory (HBM) access, and cross-chip communication can cost orders of magnitude more energy than local arithmetic. See Table 1.

Table 1 Here is a comparison among capacity, energy consumption, bandwidth, and latency in a typical memory hierarchy. Source: Author

As model size grows, an increasing share of system energy is spent moving data rather than computing on it. The arithmetic units stall while waiting for weight tensors to arrive, and effective throughput becomes a function of bandwidth and latency rather than raw FLOPS. The challenge compounds when models exceed single-device memory capacity and must be distributed across multiple accelerators.

Frontier LLMs challenging foundations of GPU architecture

The historical success of GPUs in machine learning emerged from an unusually strong alignment between hardware structure and model behavior. Modern GPUs from companies such as Nvidia and AMD are fundamentally throughput-oriented processors built around the single instruction multiple threads (SIMT) execution model.

Groups of threads—warps on Nvidia architectures, wavefronts on AMD architectures—execute instructions in lockstep. Maximum efficiency is achieved when threads follow identical execution paths, access memory in predictable patterns, and sustain dense arithmetic workloads with minimal synchronization overhead.

This design originated in graphics rendering, where millions of pixels or vertices undergo nearly identical operations in parallel. The same architectural assumptions proved highly effective for early deep learning systems, particularly convolutional neural networks and dense transformers. Large matrix multiplications, regular tensor shapes, and high arithmetic intensity mapped naturally onto GPU tensor cores and wide vectorized execution pipelines. Under sufficiently large batch sizes, GPUs can sustain exceptionally high utilization because computation dominates memory latency and control-flow overhead.

Frontier LLMs, however, are evolving away from the dense and homogeneous workloads that originally favored GPU architectures.

Modern LLM systems increasingly incorporate conditional computation: mixture of experts (MoE) layers, dynamic token routing, retrieval augmentation, speculative decoding, adaptive context management, variable sequence lengths, and sparsity-aware attention mechanisms. These techniques improve scaling efficiency at the model level by reducing the amount of computation performed per token while preserving or increasing representational capacity. They also introduce irregularity into execution patterns, precisely the condition under which SIMT architectures become less efficient.

The key issue is not simply “warp divergence” in the narrow classical GPU sense where threads within a warp follow different branches of a control-flow statement. In many MoE implementations, tokens routed to different experts are regrouped before execution specifically to minimize intra-warp divergence.

The deeper architectural tension is broader: SIMT processors are optimized for spatially and temporally coherent workloads, while modern frontier inference increasingly behaves like sparse, dynamically scheduled computation with uneven work distribution and heavy communication dependencies.

In dense transformers, nearly every parameter participates in every token evaluation. Computational intensity remains high, tensor dimensions are regular, and work scheduling is relatively predictable. In sparse MoE systems, by contrast, only a small subset of experts may activate for a given token. A model with 16 experts and top-2 routing, for example, activates only a fraction of total parameters at each inference step. Although this dramatically improves parameter efficiency from a modeling perspective, it also fragments execution into uneven and dynamically changing workloads.

The consequence is reduced effective hardware utilization, not necessarily because every warp is internally diverging, but because the overall system struggles to maintain uniform occupancy, balanced scheduling, and continuous tensor-core saturation. Some experts become overloaded while others sit idle.

Token batches routed to a given expert may be too small to fully utilize matrix engines efficiently. Memory access patterns become less regular. Kernel launch granularity deteriorates. Synchronization overhead increases. The result is that the theoretical arithmetic throughput of the GPU becomes increasingly difficult to translate into sustained application-level throughput.

Furthermore, interactive AI workflows, especially AI agents that respond step by step, are difficult for GPUs to run efficiently. GPUs work best when they can process very large batches of data at once. In LLMs, this usually means combining many user requests together into large matrix operations. Large matrix operations are efficient because they involve much more computation than data movement, keeping the GPU fully occupied.

But interactive systems need low latency: the model must respond immediately instead of waiting to accumulate a large batch of requests. That means the batch size stays small. Small batches create smaller matrix operations that are less efficient on GPUs. The GPU spends more time moving data around and less time doing computation. As a result, GPU utilization drops. So, there is a trade-off. Large batches lead to high GPU efficiency but higher latency. Conversely, small batches cause low latency but worse GPU efficiency.

Agentic workflows usually prioritize responsiveness, which is why they are harder to run efficiently on GPUs.

The resulting inefficiencies are often obscured by headline FLOP metrics. Modern accelerators advertise enormous peak throughput numbers, but peak throughput reflects idealized dense execution under carefully tuned conditions. Real-world frontier inference frequently operates far from these conditions.

Effective utilization may decline substantially when workloads become routing-heavy, communication-bound, latency-sensitive, or dynamically imbalanced. In practice, the limiting resource increasingly shifts from raw arithmetic capability to orchestration efficiency across memory systems, interconnects, and distributed scheduling layers.

The hidden GPU complexity tax

Alongside these architectural mismatches lies another challenge: the growing software and optimization burden required to extract acceptable performance from GPU systems.

GPUs do not automatically deliver near-peak efficiency. High performance requires extensive manual optimization across multiple abstraction layers. Developers must orchestrate host-device memory transfers, optimize tensor layouts, tune kernel launch parameters, manage register pressure, balance shared-memory usage, fuse operations to reduce synchronization overhead, and carefully align workloads with hardware-specific execution characteristics. Small deviations in tensor dimensions, sequence lengths, routing distributions, or batch composition can materially reduce throughput.

As models become more dynamic, optimization itself becomes more fragile. Kernels tuned for one generation of hardware may perform poorly on another. Code paths optimized for dense transformers may degrade under sparse routing conditions. Performance engineering increasingly depends on vendor-specific toolchains such as CUDA, custom compiler stacks, graph schedulers, and specialized communication libraries tightly coupled to a particular hardware ecosystem.

The cumulative effect is a growing “complexity tax” surrounding GPU-centric AI infrastructure. The cost is not merely electrical power or silicon area, but engineering specialization, portability constraints, software maintenance overhead, and system fragility. As frontier models continue shifting toward sparse, distributed, and conditionally executed architectures, the tension between SIMT-oriented hardware assumptions and emerging AI workloads is becoming increasingly difficult to ignore.

Alternative AI processing architectures are mandatory

These pressures have catalyzed interest in alternative accelerator architectures designed explicitly around transformer workloads and data movement efficiency. Systems such as the TPUs developed by Google emphasize systolic arrays and compiler-driven dataflow scheduling to improve determinism and reduce divergence overhead.

Cerebras Systems has pursued wafer-scale integration, placing tens of gigabytes of SRAM directly on-chip in its wafer-scale engine to minimize off-chip memory traffic and reduce partitioning complexity. Graphcore designed its intelligence processing unit (IPU) around fine-grained parallelism and distributed local memory, explicitly targeting irregular and sparse workloads.

Drawing on more than two decades of architectural expertise and 14 silicon tapeouts, VSORA developed an approach replacing the SIMT computational model with a dataflow architecture specifically engineered to overcome the memory wall. At its core is a massive flat register file spanning several megabytes, designed to supply data directly to large arrays of compute engines organized into wide, deeply pipelined execution paths.

Anticipating the evolving requirements of edge inference and future AI algorithms such as those for autonomous driving (AD L3-L5) applications, it also designed and embedded highly programmable processing cores capable of executing an extensive library of DSP operations with low latency and high efficiency.

While each approach involves trade-offs and varying degrees of ecosystem maturity, they share a common premise: future AI workloads are constrained less by arithmetic throughput and more by data orchestration, locality, and communication efficiency.

The next compute frontier

The broader trend in AI systems reflects a shift in the dominant bottleneck. During the convolutional era, compute capacity measured in TFLOPS was the primary metric. Early transformer models balanced compute and memory bandwidth. Frontier LLMs at trillion-parameter scale are now constrained primarily by memory movement and interconnect efficiency. As sparsity and conditional activation become central architectural features, the efficiency of routing and dataflow scheduling begins to outweigh peak arithmetic density.

GPUs remain foundational to AI infrastructure, particularly in training. Their ecosystem maturity, programmability, and unmatched dense training throughput ensure continued relevance, particularly during large-scale pretraining where arithmetic intensity remains high and workloads are relatively regular. However, as models grow more conditional, more distributed, and more memory-bound, the architectural friction becomes increasingly visible.

The future of AI acceleration will likely reward designs that privilege data locality, minimize cross-device communication, and execute sparse patterns natively rather than emulating them within a dense SIMT framework.

The decisive question for next-generation systems is no longer how many floating-point operations per second can be delivered in isolation. It is how efficiently data can be moved, routed, and scheduled across increasingly complex and sparsely activated models.

Lauro Rizzatti is a business development executive with VSORA, a technology company offering silicon semiconductor solutions that redefine performance. He is a noted chip design verification consultant and industry expert on hardware emulation.

Related Content

The post GPUs: A high-throughput architecture confronting a workload shift appeared first on EDN.

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow