As we established in our previous dive into SONiC-DASH, the project is not just a "NIC-based flavor" of SONiC. It is an architectural extension designed to offload Stateful Overlay Services to the hardware edge.
While standard SONiC excels at managing the data center underlay (L3 routing, BGP, ECMP), DASH introduces high-scale primitives like ENIs (Elastic Network Interfaces) and VNET Peering. The technical challenge here is the scale: we are no longer just looking for a route in a routing table; we are managing millions of flows with specific performance targets for CPS (Connections Per Second) and PPS (Packets Per Second).
However, building these capabilities is only half the battle; validating them is another major part of DASH development. This article explores the evolving landscape of DASH validation, targeting network engineers and QA architects tasked with bringing these high-performance systems to production. By reading on, you will gain a clear understanding of the DASH Test Maturity Model, learn how to navigate the choice between legacy tools and modern frameworks, and discover the data-driven strategies required to verify cloud-scale performance.
Why DASH Demands a New Testing Approach
For years, the sonic-mgmt and SAI PTF test frameworks have been the gold standard for validating the SAI/SONiC ecosystem. They are often used collaboratively, but they serve distinct roles in the development lifecycle:
- sonic-mgmt: This is the comprehensive, Ansible-based repository used for system-level, end-to-end testing of the entire SONiC NOS. It is the gold standard for validating how a switch behaves in a full Clos topology (T0, T1, T2) and ensuring that features like BGP and ECMP work across the entire stack.
- SAI PTF (Packet Test Framework): A specialized, targeted framework for validating the Switch Abstraction Interface (SAI) API. It focuses on dataplane packet testing and is essential when porting a new ASIC, debugging hardware forwarding issues, or validating SAI conformance.
For classic networking infrastructure where the goal is stable, high-capacity routing, these tools are unparalleled. However, as DASH extends SONiC beyond standard boundaries into the edge and customized networking paths, it introduces a category of "stateful" complexity that challenges these traditional models.
DASH is designed to handle millions of flows and massive SDN configurations. This shift creates a "validation gap" that traditional tools weren't originally built to bridge:
- The Need for Data-Driven Testing: While testing with SAI PTF is possible, it lacks a native Data-Driven approach. In a DASH environment, we aren't just testing a few dozen ACLs; we are managing millions of ENIs and mapping entries. Hardcoding these into traditional PTF scripts is neither scalable nor maintainable.
- Lack of Hardware Traffic Generator Integration: Traditional frameworks rely on software-based packet generation (like Scapy), which cannot reach the line rates or simulate the high-load stateful traffic required for DASH.
- Infrastructure vs. Service: While sonic-mgmt is indispensable for the latest stages of full NOS integration, it was originally created for classic switches and routers. It excels at proving a box is a good "network citizen," but DASH requires proving the box can act as a high-performance cloud-service appliance.
This shift from traditional infrastructure to the specialized cloud services mentioned in our previous article creates a new, multi-dimensional set of QA requirements:
- Overlay Policy & VNET Validation: Unlike standard L3 routing, we must validate the logic of complex overlays (VNET-to-VNET, VNET Peering). This requires verifying that packets are correctly encapsulated/decapsulated and that LPM (Longest Prefix Match) lookups work across millions of virtual network mappings.
- Stateful Security & Connection Tracking: For services like the Load Balancer and Stateful ACLs, the hardware must maintain the "state" of millions of active TCP connections. QA must ensure that state updates happen at line rate and that no sessions are dropped during high-load transitions.
- High Availability (HA) & Failover Reliability: In an HA scenario, the standby DPU must be perfectly synchronized with the active unit. Testing here focuses on "zero-drop" failover—ensuring that the backup unit can seamlessly assume all network responsibilities without breaking existing sessions.
- Tunneling, Private Link & Encryption: For Encryption Gateways and Service Tunnels, we move into the realm of cryptographic verification. This requires validating that packet transformations (encryption/decryption) happen with sub-microsecond latency and absolute data integrity.
- Cloud-Scale Performance (The "Hero" Metrics): Finally, all the above must be validated at a scale that breaks traditional tools. We are looking for Connections Per Second (CPS) and Packets Per Second (PPS) targets that are 10x to 100x higher than software-based stacks can achieve.
Ultimately, taking into account the collaborative strengths but architectural limitations of sonic-mgmt and SAI PTF when faced with the high-scale challenges of SONiC-DASH, it becomes clear that a new validation paradigm is required.
To bridge the gap between core infrastructure and edge services, we need an approach that enables convenient, reusable, and truly data-driven test cases - designed from the ground up to support hardware traffic generators (HW TG) and seamless CI integration. Before we dive into a detailed comparison of the specific tools that solve these problems, let’s first look at the community-defined DASH Test Maturity Stages to establish a common vocabulary and understand the incremental workflow required for modern xPU validation.
The Community Roadmap: DASH Test Maturity Stages
Navigating the transition from traditional networking to DASH requires a high-resolution view of the testing lifecycle. The community defines two parallel paths - Data Plane Testing and Stack Testing - each consisting of five distinct stages designed to establish a "common vocabulary and workflow" for all project participants.
Bottom-Up: Data Plane Testing Stages
This path focuses on the hardware's packet processing logic. By progressively adding layers of the SONiC stack, teams can isolate whether a performance bottleneck lies in the silicon or the software orchestration.
- Stage 1: Vendor Proprietary (Proprietary Config & Traffic). Testing begins with vendor-specific APIs (gRPC/REST) and manual traffic generation to prove initial hardware functionality.
- Stage 2: Standardized, Automated Test Cases. The transition to standardized, data-driven test suites and traffic generation, e.g., OpenTrafficGenerator and snappi.
- Stage 3: SAI-Thrift Integration. A crucial milestone that verifies the ability to integrate with the SONiC stack via the SAI-Thrift interface. This allows for a direct comparison of DUT behavior with and without the NOS.
- Stage 4: SAI-Redis Integration. Testing moves "down the stack" to integrate with the Redis ASIC_DB and the syncd daemon, replicating the internal SONiC communication flow.
- Stage 5: Full Northbound API. The culmination of data plane integration, where the DUT is controlled via the final management endpoint, such as gNMI.
Top-Down: SONiC-DASH Stack Testing Stages
In parallel, stack testing ensures that the control plane - from the SDN controller down to the ASIC - is ready to manage complex overlay services.
- Stage 1: Dummy Northbound API. Focuses on the management API in isolation, mapping northbound objects to the Redis APPL_DB.
- Stage 2: Redis Logic. Validates the data schema and CRUD (Create, Read, Update, Delete) access within the APPL_DB.
- Stage 3: Functional Orchestration. The introduction of a functional orchd (orchestration daemon) to handle DASH-specific application objects and enhance the ASIC_DB.
- Stage 4: Syncd & Fake SAI. Translating ASIC_DB objects into SAI calls using a functional syncd linked to a "fake" libsai library to test logic without hardware.
- Stage 5: Full Stack Convergence. The convergence of the top-down and bottom-up paths represents a fully integrated, production-ready system.
Choosing the right tool for SONiC-DASH validation is not about finding a single "perfect" framework, but about selecting the right methodology for the current maturity stage of your implementation. While traditional SONiC testing remains essential for core infrastructure, the unique requirements of the network edge demand frameworks that can handle massive scale and declarative configurations.
High-Level Approaches – Choosing the Right Tooling
To effectively navigate the DASH testing landscape, engineering teams typically choose between three primary architectural approaches. Each has its own strengths, depending on whether the goal is functional logic, infrastructure stability, or high-performance hardware validation.
1. SAI-PTF (Functional Foundation)
SAI PTF is a specialized, targeted framework for validating the Switch Abstraction Interface (SAI) API. It focuses heavily on dataplane packet testing by using a Python-based client-server model over SAI-Thrift, allowing test scripts to invoke SAI API calls as remote procedure calls (RPC).
- Best For: Low-level functional unit testing and Data Plane Maturity Stages 1-3
.
- Pros: Highly standardized with a deep community history; excellent for "packet-at-a-time" logic checks and raw SAI attribute validation.
- Cons: Hardcoding configurations for millions of flows is neither scalable nor maintainable; software-based traffic generation (Scapy) cannot reach xPU line rates.
When to use it: You should reach for SAI PTF when your primary goal is ASIC onboarding or SAI conformance. If you are porting a new silicon to the DASH ecosystem and need to debug hardware forwarding issues at a granular level, PTF is the right tool. It allows you to isolate the SAI library from the rest of the NOS, ensuring that the "glue" between the software and the silicon is functionally perfect before adding more complexity.
2. sonic-mgmt (Infrastructure & Topology)
sonic-mgmt is the comprehensive, Ansible-based test framework used for system-level, end-to-end testing of the entire SONiC NOS. It is the industry standard for orchestrating complex network fabrics and verifying that the Device Under Test (DUT) behaves correctly within a multi-node environment. While it is the industry standard for orchestrating complex Clos topologies, it has been updated with a dedicated DASH test suite to validate the integration of DASH services into the broader SONiC ecosystem.
- Best For: Stage 5 (Full Stack) validation. It is the primary tool for testing the Northbound gNMI path and ensuring correct SONiC DASH full-stack cooperation (like orchd, syncd, redis, gNMI), including basic ("packet-at-a-time") Data Plane validation.
- Pros: Validates the entire "intent-to-asic" pipeline; uses standard Pytest-ansible patterns familiar to SONiC engineers; excellent for functional regression of the management plane.
- Cons: Not optimized for performance or hyper-scale validation; traffic generation is typically software-based (Scapy/PTF), which lacks the density needed for hardware-rate stress tests.
When to use it: You should use sonic-mgmt when your focus shifts from "does the silicon work?" to "does the system work?". If you have developed a new DASH feature, such as a specific VNET-peering logic, you could use sonic-mgmt to prove that DUT can be configured via gNMI, validate the applied configuration with some traffic, and that the system maintains stability during the process.
3. SAI Challenger with OTG (Modern & Scalable)
This framework is specifically enhanced to address the unique challenges of DASH by introducing a Data-Driven approach that decouples test logic from the underlying RPC mechanism (Thrift or Redis). It integrates with the Open Traffic Generator (OTG) via the snappi SDK, providing a unified path from initial software simulation to production-grade hardware validation.
-
Best For: The entire DASH lifecycle—from early-stage functional testing in CI pipelines to "Hero" performance validation on real hardware.
-
Pros:
- Resusability: The same test script can run against a software simulator (bmv2) using virtual traffic generators like Ixia-c and then move seamlessly to physical xPUs and hardware traffic generators without code changes.
- Declarative Configuration: Supports massive table entries (millions of flows) via dpugen
, which improves maintainability and reduces the labor of manual API calls.
- Umbrella Support: Acts as a PTF Test Runner, allowing teams to execute legacy SAI PTF test cases natively with zero changes, ensuring previous investments in testing are not lost.
-
Cons: As a newer ecosystem, it requires a steeper learning curve than legacy PTF scripts. Additionally, while it excels at SAI-level validation, it does not currently support the Northbound gNMI management interface.
When to use it: SAI Challenger is the definitive choice for teams adopting a "shift-left" strategy, as it is the only framework that provides a unified workflow across all DASH Maturity Stages (with only the limitation of the Northbound gNMI management interface support). Because the framework is target-agnostic and supports legacy SAI-PTF test cases as an umbrella runner, it eliminates the need to rewrite scripts as the implementation matures. This ensures a seamless transition from behavioral simulation (bmv2) to high-speed hardware "Hero Test", making it a critical tool for those needing to prove cloud-scale connection performance (CPS/PPS) on real hardware.
Comparison of DASH Testing Approaches
| Feature | SAI-PTF | sonic-mgmt | SAI Challenger + OTG |
|---|---|---|---|
| Primary Target | Functional SAI API and logic validation. | Full-stack system integration and fabric-level stability. | Functional SAI API and logic validation & Cloud-scale performance and end-to-end hardware validation. |
| Configuration API | SAI-Thrift (RPC). | gNMI, Management interfaces, or Config DB. | SAI-Thrift or Redis. |
| Traffic Engine | Software-based PTF/Scapy (packet-at-a-time). | Software-based PTF/Scapy (packet-at-a-time). | OTG / snappi SDK (Agnostic to SW and HW traffic generators) Software-based PTF/Scapy (packet-at-a-time). |
| Philosophy | Imperative: Manual, step-by-step API scripts. | Integration-Centric: Proving the device is a stable network node. | Declarative: Data-driven automation focused on scalability. |
| Maturity Stage | Primarily Data Plane Stages 1–3. | Primarily Full Stack Stage 5. | Unified path covering Data Plane & Stack Stages 1–5. |
| Scalability | Low: Limited by manual coding and software traffic generation. | Medium: Optimized for standard, fixed-topology data center tests. | High: Built to handle millions of flows using data-driven tools. |
The "Hero Test": The Gold Standard of Performance
Any discussion of DASH tooling eventually leads to the "Hero Test". This is the ultimate validation benchmark for high-performance xPUs, designed to push silicon to its absolute limits. It represents the "Level 5" of maturity, where every component: silicon, drivers, and the DASH pipeline must perform in perfect harmony.
A typical Hero Test involves:
- Performance-at-Scale: Validating 800G+ line rates and massive Connections Per Second (CPS) targets.
- Stateful Depth: Maintaining over 120 million active background flows with zero packet loss.
- Real-World Stress: Using OTG-compliant hardware testers to simulate thousands of concurrent SDN policies while monitoring hardware-level metrics like latency and throughput.
Achieving these "Hero" metrics requires more than just high-speed traffic; it requires a configuration engine capable of generating millions of table entries without crashing the test server. This is where dpugen becomes essential.
Standard static JSON files for a Hero Test exceed 1.5 GB in size, which would be impossible to process traditionally. dpugen solves this by acting as an iterator. It streams SAI records on-the-fly, allowing the framework to hold only a tiny amount of data in memory at any given time while configuring the Device Under Test (DUT) with millions of rules.
The Hero Test is more than just a performance check; it is a complex intersection of high-speed hardware engineering, P4 pipeline optimization, and cloud-scale orchestration. Because of its depth and the practical expertise required to execute it, ranging from fine-tuning ARM cores to managing 800G testbeds, this topic truly deserves its own dedicated deep-dive article.
Summary
The transition of SONiC from core data center infrastructure to the high-performance network edge via DASH represents more than just a change in hardware - it is a shift in the entire operational and validation paradigm - the stateful nature of DASH services and the massive scale of SDN policies demand a move toward target-agnostic and data-driven testing methodologies.
Looking Ahead: The Future of DASH
As we concluded in our initial look at SONiC-DASH, this technology is poised to follow the same successful trajectory as standard SONiC. SONiC has already earned the industry’s trust as the definitive open-source NOS for data center switches, and we can expect a similar evolution for DASH. The heavy investment from major industry leaders, including Microsoft, NVIDIA, Marvell, and Broadcom, underscores a collective commitment to making programmable xPU infrastructure a reality.
With this level of commercial interest, we can expect a rapid evolution of validation frameworks. The SAI Challenger + OTG solution already offers a glimpse into this future, providing a seamless "shift-left" workflow that allows for early SAI testing on software models (bmv2) and a smooth transition to real hardware for final, line-rate "Hero Tests." At the same time, we can be certain that established tools like SAI PTF and sonic-mgmt will continue to improve and evolve to cover DASH-specific requirements.
Ultimately, your choice of methodology should be driven by your specific goals: on which stage of DASH validation and development are you currently, and what level of resources are you ready to invest? Rather than being intimidated by the complexity of DASH validation, we encourage you to see it as a gateway to the next generation of networking. The tools are maturing, the community is growing, and now is the perfect time to investigate and learn more about this transformative technology.
Key Takeaways for Success
- Embrace the Maturity Model: Successful DASH deployments are built on an incremental roadmap. Starting with Stage 1 Behavioral Models allows for catching logic errors early, significantly reducing the cost of hardware-level bugs.
- Decouple and Scale: By adopting frameworks that decouple the test logic from the underlying API (like SAI Challenger) and leverage declarative configuration tools (like dpugen), organizations can ensure their test suites remain maintainable even as they scale to millions of flows.
- Bridge to Hardware: The path to line-rate performance requires a "shift left" strategy. Using OTG-compliant tools ensures that the functional tests used during the software simulation phase can be seamlessly ported to high-speed hardware testers for final "Hero Test" validation.


