How data movement defines performance for AI silicon

When data movement is delayed, even the fastest compute engines are left waiting, reducing throughput, increasing latency, and wasting power. The post How data movement defines performance for AI silicon appeared first on EDN.

How data movement defines performance for AI silicon

Regardless of the applications, most artificial intelligence (AI) chip designers face the same challenges. Whether it’s cloud data centers, edge devices, automotive platforms, or industrial robotics, optimal performance now depends on how efficiently data is moved.

When data movement is delayed, even the fastest compute engines are left waiting, reducing throughput, increasing latency, and wasting power.

As AI designs continue to grow in complexity, managing massive data flows through fixed, point-to-point connections no longer scales efficiently. Designers are now dealing with hundreds of compute engines and memory instances, each with different performance requirements, all of which must move data simultaneously.

A network-on-chip (NoC) brings order to chaos by providing a scalable, shared communication infrastructure that moves data where it needs to go with controlled latency and bandwidth. With built-in mechanisms for congestion management, traffic prioritization, and workload isolation, NoCs help teams deliver consistent, predictable performance while staying within tight power, area, and timing budgets.

Different markets, same bottleneck

Whether in hyperscale cloud infrastructure or inside an embedded vision processor, the core problem is data bottlenecks. The end markets differ, but the underlying architectural constraint remains the same. In the cloud, the goal is maximum throughput. Training clusters push bandwidth into the terabytes-per-second range. Massive GPUs and AI accelerators continuously ingest and process vast datasets. In large data center GPUs, more than 80% of dynamic energy is consumed by data transfers to and from DRAM. That energy is not spent on computing. It is spent moving bits.

At the edge, priorities flip. Systems such as autonomous vehicles, robotics, and smart cameras demand microsecond-level latency, strict determinism, and ultra-low power consumption. Edge AI devices may spend up to 90% of inference time waiting on memory I/O.

This is the invisible drain on AI performance.

Why NoC architecture matters

The NoC is the backbone that determines how efficiently data flows within a system-on-chip (SoC) or across multiple dies. However, the NoC must be optimized correctly. If not, the entire system slows down, regardless of how powerful the compute cores may be.

AI designs often rely on wide parallel interfaces between IP blocks. As system innovation increases, routing congestion, timing closure issues, and power overhead become more difficult to manage. An NoC addresses these challenges by packetizing traffic. Transactions are broken into packets and routed across a structured fabric, much like off-chip networking. This approach significantly reduces wiring complexity.

A wide AXI interface can require hundreds of signals; for example, a given AXI bus interface that requires 280 signals can be reduced to 150 by packetizing transactions. Fewer wires mean less congestion, simpler routing, easier timing closure, reduced silicon area, and lower dynamic power, as shown in the figure below.

Here is an outline of the advantages of packetized data with NoC IP Source: Arteris

Equally important, an NoC decouples IP blocks from transport details. Designers integrate heterogeneous CPUs, GPUs, NPUs, memory controllers, and accelerators without manually wiring hundreds of signals between blocks. The network fabric handles transport abstraction. This level of decoupling does more than simplify integration within a single die. It also lays the groundwork for the next major shift in system design, where functionality is distributed across multiple dies and coordinated at the system level.

From monolithic dies to systems of systems

The separation of IP from transport becomes critical as designs transition to chiplet-based architectures. The shift enables teams to optimize each piece of silicon independently for its specific function and power trade-offs. It also improves yield, lowers costs, and makes it easier to increase compute capacity by adding or reusing chiplets as requirements change.

Within each die, a coherent NoC uses standard protocols such as AMBA CHI or ACE. Non-coherent fabrics connect peripherals and specialized engines into the broader system. Across dies, UCIe enables high-speed die-to-die communication. In advanced multi-package systems, coherent and non-coherent NoCs communicate seamlessly across chiplet boundaries.

The result is effectively a system of systems, with multiple specialized silicon components orchestrated into a unified compute engine. The NoC fabric spans the entire package, coordinating traffic between dies and subsystems.

In this environment, the interconnect is no longer just a supporting block. It shapes the entire system architecture. Every AI system, whether in the cloud or at the edge, has to strike the right balance among three things. Bandwidth must keep GPUs, XPUs, and AI engines fully utilized. Latency must remain low to support real-time inference and control. Efficiency must hold power and thermal budgets within limits as systems expand.

Designers also need a practical way to grow compute resources without redesigning the interconnect. Modular tiling approaches address that need. Each tile includes its own network interface unit and can be replicated across an NPU array. Need more compute? Add more tiles. The fabric scales without requiring a complete redesign.

Closing the architectural loop

In AI SoCs, designing the NoC requires more than defining the logical topology. Engineers should introduce physical awareness early in the design process. That means using floorplan information, estimated wire distances, and timing constraints. Physical awareness must be built directly into the design flow.

A modern NoC design flow includes:

  1. High-level architectural modeling and simulation
  2. Integration of physical constraints through virtual floor planning
  3. Automatic insertion of pipeline stages with built-in timing analysis
  4. Closed-loop export of constraints to physical synthesis tools

This approach bridges the gap between architectural intent and layout reality. In production designs, physically aware NoC automation has demonstrated the ability to reduce total wire length by roughly 26%, cut maximum latency by half, and improve overall productivity by an order of magnitude. Tasks that once required weeks of manual tuning can now be completed in less than a day.

Cache hierarchy and data locality

Interconnect optimization must be paired with effective cache architecture. Multi-level cache hierarchies, including L1, L2, and L3, store frequently used data close to the compute engines, reducing memory access latency. Without an effective cache hierarchy, CPU utilization can drop to single digits.

In some AI SoC regions, last-level non-coherent caches improve data availability without participating in a full coherency protocol. Workloads that do not require tight synchronization, such as certain signal-processing or multimedia tasks, benefit from this approach, which simplifies the design while improving throughput. By increasing data locality, the cache structure reduces reliance on external memory and stabilizes interconnect traffic.

The reality of AI SoC design

The cost of developing leading-edge SoCs has risen from under $100 million a decade ago to more than $700 million today. So, each design iteration or silicon re-spin carries enormous financial risk.

Manual integration processes, fragile scripting, and misaligned hardware-software interfaces amplify that risk. Automated SoC integration flows that validate IP early, maintain consistent specifications across teams, and compile millions of registers in minutes can significantly reduce development time and errors.

Arteris addresses these architectural demands with interconnect IP purpose-built for complex AI platforms where efficient data transport determines overall system behavior. Its FlexNoC and Ncore solutions provide configurable non-coherent and coherent fabrics that support heterogeneous compute clusters and multi-die designs, reducing communication bottlenecks that limit utilization.

By aligning scalable interconnect architecture with disciplined implementation methodology, these interconnect solutions enables design teams to translate system intent into silicon more predictably in an era defined by rising complexity and cost sensitivity.

Automation and physically aware design are no longer optional optimizations. They are survival tools in the AI decade.

Andy Nightingale, VP of product management and marketing at Arteris, has over 39 years of experience in the high-tech industry, including 23 years in various engineering and product management roles at Arm.

 

Related Content

The post How data movement defines performance for AI silicon appeared first on EDN.

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow