PCIe and CXL Interconnects for AI Accelerators: Performance, Latency, and Telemetry
DOI:
https://doi.org/10.22399/ijcesen.4956Keywords:
PCIe 6.0 PAM4 Interconnects, CXL 3.0 Fabric Architecture, Telemetry-Driven Runtime Optimization, AI Accelerator Latency Characterization, Credit-Aware Flow ControlAbstract
The exponential growth of artificial intelligence and high-performance computing workloads has fundamentally transformed system design priorities, shifting performance bottlenecks from computational resources to interconnect infrastructure. Modern AI accelerators demand unprecedented bandwidth and predictable latency characteristics that challenge traditional interconnect technologies, particularly in heterogeneous computing environments where processors, accelerators, and memory expansion devices must communicate efficiently across complex fabric topologies. This article presents a unified framework for characterizing and optimizing PCIe 6.0 and CXL 3.0 interconnect fabrics, addressing critical challenges in latency predictability, throughput maximization, and operational observability. Through comprehensive modeling of protocol stack behaviors, physical layer characteristics, and multi-level switching architectures, the article quantifies end-to-end latency contributors including forward error correction overhead, credit-based flow control delays, and switch traversal costs. A telemetry-driven runtime framework integrates PCIe Advanced Error Reporting and CXL Fabric Manager interfaces to enable adaptive optimization policies encompassing credit-aware scheduling, dynamic link management, intelligent memory tiering, and energy-efficient controller operation. Machine learning classifiers built on historical telemetry data enable predictive maintenance capabilities that identify degrading links before service disruptions occur. Experimental validation across transformer training, large language model inference, and representative scientific computing kernels demonstrates substantial improvements in tail latency, aggregate throughput, and energy efficiency. The article provides practical guidance for fabric architects designing next-generation disaggregated computing infrastructures while identifying critical challenges and opportunities in scaling these approaches to hyper-scale deployments.
References
[1] Debendra Das Sharma, et al.,”Compute Express LinkTM (CXL™): An Open Industry Standard for Composable Computing,” compute expresslink, August 2023. https://computeexpresslink.org/wp-content/uploads/2023/12/CXL_FMS-2023-Tutorial_FINAL.pdf
[2] PCI SIG, “PCI Express 6.0 Specification”. https://pcisig.com/pci-express-6.0-specification
[3] Compute Express Link, “Specification”, March 2019, Revision: 1.0. https://computeexpresslink.org/wp-content/uploads/2024/02/CXL-1.0-Specification.pdf
[4] Tektronix, “PAM4 Signaling in High-Speed Serial Technology: Test, Analysis, and Debug.” https://download.tek.com/document/PAM4-Signaling-in-High-Speed-Serial-Technology_55W-60273.pdf
[5] Babangida Isyaku, “Route Path Selection Optimization Scheme Based Link Quality Estimation and Critical Switch Awareness for Software Defined Networks”, 29 September 2021. https://www.mdpi.com/2076-3417/11/19/9100
[6] T. Long Nguyen, “ The PCI Express Advanced Error Reporting Driver Guide HOWTO,” 2006. https://www.kernel.org/doc/html/latest/PCI/pcieaer-howto.html
[7] OpenCIS. https://www.opencis.io/
[8] Omodara Ebun, Mei Song, “Machine Learning-Based Predictive Maintenance for Large-Scale Software and Remote Sensing Systems”, February 2025. https://www.researchgate.net/publication/388970079_Machine_Learning-Based_Predictive_Maintenance_for_Large-Scale_Software_and_Remote_Sensing_Systems
[9] mrana, “Flow Control Credit Updates in PCIe 6.1 ECN”, Cadence, 13 Sep 2024. https://community.cadence.com/cadence_blogs_8/b/fv/posts/flow-control-credit-updates
[10] Hyungjun Cho, et al., “Adaptive Migration Decision for Multi-Tenant Memory Systems,” 14 May 2025. https://arxiv.org/pdf/2505.09164#:~:text=To%20place%20memory%20pages%20likely,accu%2D%20rately%20select%20hot%20pages
[11] Electronics Tutorials, “Closed-loop Systems.” https://www.electronics-tutorials.ws/systems/closed-loop-system.html
[12] Nvidia, “NVIDIA NVLink and NVLink Switch.” https://www.nvidia.com/en-us/data-center/nvlink/
[13] Y. Tian, "Improved Zero-forcing and Minimum Mean Square Error Detection Algorithms for Space - time Encoder," in 2022 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS), Hengyang, China, 2022, pp. 131-134, doi: 10.1109/ICITBS55627.2022.00037. https://www.computer.org/csdl/proceedings-article/icitbs/2022/972100a131/1PIaoyDIOmk
[14] Vedran Dakić, et al., "The RedFish API and vSphere Hypervisor API: A Unified Framework for Policy-Based Server Monitoring" Electronics 13, no. 23: 4624, 2024. https://doi.org/10.3390/electronics13234624
[15] Gary Ruggles, “How CXL Is Improving Latency in High-Performance Computing”, Synopsys, Aug 08, 23. https://www.synopsys.com/blogs/chip-design/cxl-protocol-memory-pooling.html
[16] Synopsys, "PCIe 6.0 Controller IP Datasheet," 2024. https://www.synopsys.com/designware-ip/interface-ip/pci-express.html
[18] William Fedus, et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity," Journal of Machine Learning Research, 2022. https://jmlr.org/papers/v23/21-0998.html
[19] Iz Beltagy, et al., "Longformer: The Long-Document Transformer," arXiv:2004.05150, 2020. https://arxiv.org/abs/2004.05150
[20] Matthias Fey and Jan E. Lenssen, "Fast Graph Representation Learning with PyTorch Geometric," arXiv:1903.02428, 2019. https://arxiv.org/abs/1903.02428
[21] Al-Fares, Mohammad, et al., "A Scalable, Commodity Data Center Network Architecture," ACM SIGCOMM, 2008. https://doi.org/10.1145/1402958.1402967
[22] Singla, Ankit, et al., "Jellyfish: Networking Data Centers Randomly," USENIX NSDI, 2012. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/singla
[23] Intel Corporation, "Intel Trust Domain Extensions (Intel TDX)," 2023. https://www.intel.com/content/www/us/en/developer/tools/trust-domain-extensions/overview.html
[24] Paillier, Pascal, "Public-Key Cryptosystems Based on Composite Degree Residuosity Classes," EUROCRYPT, 1999. https://link.springer.com/chapter/10.1007/3-540-48910-X_16
[25] Jeon, Myeongjae, et al., "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads," USENIX ATC 2019. https://www.usenix.org/conference/atc19/presentation/jeon
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal of Computational and Experimental Science and Engineering

This work is licensed under a Creative Commons Attribution 4.0 International License.