Why verification matters in network-on-chip (NoC) design

While NoCs are designed to be efficient at data transmission via routing, they often encounter deadlocks and livelocks. The post Why verification matters in network-on-chip (NoC) design appeared first on EDN.

Electronics May 21, 2024 0 78 Add to Reading List

Why verification matters in network-on-chip (NoC) design

In the rapidly evolving semiconductor industry, keeping pace with Moore’s Law presents opportunities and challenges, particularly in system-on-chip (SoC) designs. Notably, the number of transistors in microprocessors soared to an unprecedented trillion.

Therefore, as modern applications demand increasing complexity and functionality, improving transistor usage efficiency without sacrificing energy efficiency has become a key goal. Thus, the network-on-chip (NoC) concept has been introduced, a solution designed to address the limitations of traditional bus-based systems by enabling efficient, scalable, and flexible on-chip data transmission.

Designing an NoC involves defining requirements, selecting an architecture, choosing a routing algorithm, planning the physical layout, and conducting verification to ensure performance and reliability. As the final checkpoint before a NoC can be deemed ready for deployment, a deadlock/livelock-free system can be built, increasing confidence in design verification.

In this article, we will dive deeper into a comprehensive methodology for formally verifying an NoC, showcasing the approaches and techniques that ensure our NoC designs are robust, efficient, and ready to meet the challenges of modern computing environments.

Emergence of network-on-chip

NoCs have revolutionized data communications within SoCs by organizing chip components into networks that facilitate the simultaneous transmission of data through multiple paths.

The network consists of various elements, including routers, links, and network interfaces, which facilitate communication between processing elements (PEs) such as CPU cores, memory blocks, and other specialized IP cores. Communication occurs through packet-switched data transmission where data is divided into packets and routed through the network to its destination.

One overview of the complexity of SoC design emphasizes the integration of multiple IP blocks and highlights the need for automated NoC solutions across different SoC categories, from basic to advanced. It advocates using NoCs in SoC designs to effectively achieve optimal data transfer and performance.

At the heart of NoC architecture are several key components:

Links: Bundles of wires that transmit signals.
Switches/routers: Devices routing packets from input to output channels based on a routing algorithm.
Channels: Logical connections facilitating communication between routers or switches.
Nodes: Routers or switches within the network.
Messages and packets: Units of transfer within the network, with messages being divided into multiple packets for transmission.
Flits: Flow control units within the network, dividing packets for efficient routing.

Architectural design and flow control

NoC topology plays a crucial role in optimizing data flow, with Mesh, Ring, Torus, and Butterfly topologies offering various advantages (Figure 1). Flow control mechanisms, such as circuit switching and wormhole flow control, ensure efficient data transmission and minimize congestion and latency.

Figure 1 The topology of an NoC plays an important role in optimizing data flow, as shown with Mesh and Ring (top left and right) and Torus and Butterfly (bottom left and right). Source: Axiomise

Role of routing algorithms in NoC efficiency

As we delve into the complexity of NoC design, one integral aspect that deserves attention is the routing algorithm, the brains behind the NoC that determines how packets move through the complex network from source to destination. They must be efficient, scalable, and versatile enough to adapt to different communication needs and network conditions.

Some of the common routing algorithms for network-on-chip include:

XY routing algorithm: This is a deterministic routing algorithm usually used in grid-structured NoCs. It first routes to the destination columns along the X-axis and then to the destination rows along the Y-axis. It has the advantages of simplicity and predictability, but it may not be the shortest path and does not accommodate link failures.
Parity routing algorithm: This algorithm aims to reduce network congestion and increase fault tolerance of the network. It avoids congestion by choosing different paths (based on the parity of the source and destination) in different situations.
Adaptive routing algorithms: These algorithms dynamically change routing decisions based on the current state of the network (for example, link congestion). They are more flexible than XY routing algorithms and can optimize paths based on network conditions, but they are more complex to implement.
Shortest path routing algorithms: These algorithms find the shortest path from the source node to the destination node. They are less commonly used in NoC design because calculating the path in real-time can be costly, but they can also be used for path pre-computation or heuristic adjustment.

Advantages of NoCs

Scalability: As chip designs become more complex and incorporate more components, NoCs provide a scalable solution to manage interconnects efficiently. They facilitate the addition of new components without significantly impacting the existing communication infrastructure.
Parallelism: NoCs enable parallel data transfers, which can significantly increase the throughput of the system. Multiple data packets can traverse the network simultaneously along different paths, reducing data congestion and improving performance.
Power consumption: By providing shorter and more direct paths for data transfer, NoCs can reduce the chip’s overall power consumption. Efficient routing and switching mechanisms further contribute to power savings.
Improved performance: The ability to manage data traffic efficiently and minimize bottlenecks through routing algorithms enhances the overall performance of the SoC. NoCs can adapt to the varying bandwidth requirements of different IP blocks, providing optimized data transfer rates.
Quality of service (QoS): NoCs can support QoS features, ensuring that critical data transfers are given priority over less urgent communications. This is crucial for applications requiring high reliability and real-time processing.
Flexibility and customization: The flexibility and customization of the NoC architecture is largely due to its ability to employ a variety of routing algorithms based on specific design requirements and application scenarios.
Choice of routing algorithm: Routing algorithms in an NoC determine the network path of a packet from its source to its destination. The choice of routing algorithm can significantly impact the performance, efficiency, and fault recovery of the network.

NoC verification challenges

Designing an NoC and ensuring it works per specification is a formidable challenge. Power, performance, and area (PPA) optimizations—along with functional safety, security, and deadlock and livelock detection—add a significant chunk of extra verification work to functional verification, which is mostly centred on routing, data transport, data integrity, protocol verification, arbitration, and starvation checking.

Deadlocks and livelocks can cause a chip respin. For modern-day AI/ML chips, it can cost $25 million in some cases. Constrained random simulation techniques are not adequate for NoC verification. Moreover, simulation or emulation cannot provide any guarantees of correctness. So, formal methods rooted in proof-centric program reasoning are the only way of ensuring bug absence.

Formal verification to the rescue

Industrial-grade formal verification (FV) relies on using formal property verification (FPV) to perform program reasoning, whereby a requirement expressed using the formal syntax of System Verilog Assertions (SVA) is checked against the design model via an intelligent state-space search algorithm to conclude whether the intended requirement holds on all reachable states of the design.

The program reasoning effort terminates with either a proof or a disproof, generating counter-example waveforms. No stimulus is generated by human engineers, and the formal verification technology automatically generates almost an infinite set of stimuli only limited by the size of inputs. This aspect of verifying designs via proof without any human-driven stimulus and with almost an infinite set of stimuli is at the heart of formal verification.

It gives us the ability to pick corner-case issues in the design as well as pick nasty deadlocks and livelocks lurking in the design. Deep interactions in state space are examined quickly, revealing control-intensive issues in the design due to concurrent arbitration and routing traffic in the NoC.

With NoCs featuring numerous interconnected components operating in tandem, simulating the entire range of possible states and behaviors using constrained-random simulation becomes computationally burdensome and impractical. It is due to the intense effort needed for driving stimuli into the NoC that is needed to unravel the state-space interaction, which is not easily possible. This limitation undermines the reliability and precision of simulation outcomes.

Compelling advantages of NoC architectures tout the benefits of integrating FV into the design and verification process using easy-to-understand finite state machine notations and using protocol checkers developed for FV in chip and system integration testing increases confidence and aids error detection and isolation.

The effectiveness of this approach and the challenges of verifying complex systems with large state spaces are emphasized when compared to traditional system simulation successes.

An NoC formal verification methodology

In the complex process of chip design verification, achieving simplicity and efficiency amid complexity is the key. This journey is guided through syntactic and semantic simplification and innovative abstraction techniques.

In addition to these basic strategies, using invariants and an internal assumption assurance process further accelerates proof times, leveraging microarchitectural insights to bridge the gap between testbench and design under test (DUT). This complex verification dance is refined through case splitting and scenario reduction, breaking down complex interactions into manageable checks to ensure comprehensive coverage without overwhelming the verification process.

Symmetry reduction and structural decomposition address verification challenges arising from the complex behavior of large designs. These methods, along with inference-rule reduction and initial-value abstraction (IVA), provide a path that effectively covers every possible scenario, ensuring that even the most daunting designs can be confidently verified.

Rate flow and hopping techniques provide innovative solutions to manage the flow of messages and the complexity introduced by deep sequential states. Finally, black-box and cut-pointing techniques are employed to simplify the verification environment further, eliminating internal logic not directly subject to scrutiny and focusing verification efforts where they are most needed.

Through these sophisticated techniques, the goal of a thorough and efficient verification process becomes a tangible reality, demonstrating the state-of-the-art of modern chip design and verification methods.

Safeguarding NoCs against deadlocks

When setting up NoCs, it’s important for channels to be independent, but it’s not easy to ensure of this. Dependencies between channels can lead to troublesome deadlocks, where the entire system halts even if just one component fails.

Formal verification also contributes to fault tolerance, crucial in NoCs where numerous components communicate. When a component fails, it’s important to understand how close the system is to a permanent deadlock.

Formal verification exhaustively explores all possible system states, offering the best means to ensure fault tolerance. With the right approach, weaknesses of an NoC can be identified and addressed. Catching them early on can save the expensive respin.

Optimizing routing rules to suit the needs is common and critical for performance, but it can be tricky and hard to thoroughly test in simulation. Hundreds of new test cases may emerge just by introducing one new routing rule.

So, modelling all the optimizations in formal verification is crucial. If done properly, it can catch corner case bugs quickly or prove that optimizations behave as expected, preventing unexpected issues.

In the next section, we describe at a high level how some bugs can be caught with formal verification.

Formal verification case studies

Message dependence caused deadlock

A bug originated from a flaw in the flow control mechanism where both request and response packets shared the same FIFO. In this scenario, when multiple source ports initiate requests, the flow control method leads to a deadlock. For instance, when source port 0 sends a request reqs0, consisting of header flit h0req, body b0req, and tail t0req, it gets moved successfully.

Subsequently, the response resps0 made of (h1resp, b1resp, t1resp) intended also for source port 0 arrive, it causes no issue. However, when a subsequent request reqs2 from source port 2 with header flit h2req, body b2req, and tail t2req entered the FIFO, only its header and body move forward, but the tail is blocked from being sampled in the FIFO as the response’s header h2resp has blocked the tail t2req because they arrive in the same clock cycle.

Consequently, source port 2 was left waiting for the tail t2, and found itself blocked by the response header, resulting in a deadlock. Meanwhile, source port 1, also waiting for a response, would never get one, further exacerbating the deadlock situation. This deadlock scenario paralyzed the entire NoC grid, highlighting the critical flaw in the flow control mechanism.

Figure 2 Dependence between request and response causes deadlock. Source: Axiomise

Routing error caused deadlock

In the context of the previously mentioned flow control method, each source port awaits a response after sending a request. However, a deadlock arises due to a flaw in the routing function. When a request is mistakenly routed to an incorrect target port, triggering the assertion of the “wrong_dest” signal, the packet is discarded. Consequently, the source port remains in a state of deadlock, unable to proceed with further requests while awaiting a response that will never arrive.

Figure 3 A deadlock in the flow is caused by a routing error and is unable to proceed. Source: Axiomise

Redundant logic revealing PPA issues

Certain design choices in the routing algorithm, such as prohibiting-specific turns, lead to situations where several FIFOs never have push asserted, and some arbiters handle less than two requestors.

This has been identified during the verification process, revealing that these components—and consequently, millions of gates—are going unused in the design but still occupy chip area and, when clocked, would burn power while not contributing to any performance. Eliminating these superfluous gates significantly reduced manufacturing costs and improved design efficiency.

The case for formal verification in NoC

An NoC-based fabric is essential for any modern high-performance computing or AI/ML machine. NoCs enhance performance by efficient routing to avoid congestion. While NoCs are designed to be efficient at data transmission via routing, they often encounter deadlocks and livelocks in addition to the usual functional correctness challenges between source and destination nodes.

With a range of topologies possible for routing, directing simulation sequences to cover all possible source/destination pairs is almost impossible for dynamic simulation. Detecting deadlocks, starvation and livelocks is nearly impossible for any simulation or even emulation-based verification.

Formal methods drive an almost infinite amount of stimulus to cover all necessary pairs encountered in any topology. With the power of exhaustive proofs, we can establish conclusively that there isn’t a deadlock or a livelock or starvation with formal.

Editor’s Note: Axiomise published a whitepaper in 2022, summarizing a range of practically efficient formal verification techniques used for verifying high-performance NoCs.

Zifei Huang is a formal verification engineer at Axiomise, focusing on NoC and RISC-V architectures.

Adeel Liaquat is an engineering manager at Axiomise, specializing in formal verification methodologies.

Ashish Darbari is founder and CEO of Axiomise, a company offering training, consulting, services, and verification IP to various semiconductor firms.

Related Content

The post Why verification matters in network-on-chip (NoC) design appeared first on EDN.

Read Original